Turbo Team, Together.AI
Mar 2024 - PresentAdvisors: Ben Athiwaratkun (Senior Research Scientist, Together.AI) · Shuaiwen Song (Vice President of Research, Together.AI)
Efficient ML Algorithms
Ladder-Residual
Motivation. Large-model inference under tensor parallelism often suffers from communication stalls and weak overlap between communication and computation; we sought an architecture-runtime co-design that improves throughput without sacrificing model quality.
Contributions.
- Co-conceived the parallelism-aware residual design and helped shape the paper's system and evaluation story.
- Implemented and optimized the gpt-fast inference path with CUDA Graphs and PyTorch compile ("reduce-overhead") for large-model serving.
- Benchmarked performance across model scales (1B-405B) and TP world sizes (1, 2, 4, 8, 16), validating up to 30% end-to-end throughput improvement on 70B models with P2P enabled and up to 60% with P2P disabled.
CREST (Turbo-reasoning)
Motivation. Reasoning models often under-think or over-think at test time, wasting tokens or missing correct solutions; we sought a training-free intervention that could be deployed in mainstream serving stacks.
Contributions.
- Co-developed the core idea of a training-free test-time steering method that identifies and modulates cognitive attention heads, improving accuracy by up to 17.5% and reducing token usage by 37.6% across reasoning benchmarks.
- Designed deployment paths for integrating CREST into vLLM and SGLang.
CARE
Motivation. MLA-style attention can improve serving efficiency, but most pretrained checkpoints use GQA/MHA and cannot directly benefit; we sought a practical conversion path that preserves quality while lowering inference cost.
Contributions.
- Developed the core idea and empirical framing for upgrading pretrained attention into MLA-compatible forms.
- Proposed a conversion pipeline that upgrades pretrained attention (e.g. GQA) into multi-head latent attention (MLA) for faster inference without increasing KV-cache size.
- Ran the full experimental suite and carried out vLLM integration and theoretical analysis.
SQUEEZE THINK
Motivation. Recursive self-aggregation improves reasoning quality, but uniform compute allocation across generation and aggregation wastes cost on easy subsets and under-allocates recovery on hard subsets.
Contributions.
- Helped develop a multi-model orchestration view of recursive self-aggregation, routing generation and aggregation between large and small models based on cross-model confidence.
- Owned coding-benchmark execution and evaluation pipelines, especially for LiveCodeBench V6, and supported ablations on routing thresholds and aggregation behavior across AIME 2025 and HMMT 2025.
- Demonstrated 30-40% compute reduction at matched accuracy or 5-7 point accuracy gains at equivalent compute.
Agent Evolve
Motivation. Current LLM-based multiagent systems are largely static after deployment and lack mechanisms for continual adaptation across agents, skills, and populations.
Contributions.
- Built a bio-inspired LLM multiagent framework with pheromone-style memory, evolutionary division of labor, and skill inheritance for open-ended population adaptation.
- Studied population-level adaptation through competition, selection, and cross-generation strategy transfer.
- Explored integration of LEXICO compression techniques.
- Prototyped vocabulary-pruned speculators and Mix-Architecture Speculator designs.
- Explored diffusion LLMs that interleave self-verification with token generation.
- Investigated diffusion-style MoE routers for smoother expert selection.
- Investigated diffusion-style speculator design.
Efficient ML Systems
Training System — XoRL (RL Training System), Axolotl (SFT Training System)
Motivation. Building an RL and SFT training stack for coding and reasoning agents required more than model fine-tuning: it needed an end-to-end system that coupled sandboxed environments, distributed rollout workers, and multi-node training plus serving infrastructure while staying stable under long-context, MoE, and rapidly changing model variants.
Contributions.
- Built much of the training-side RL framework, including agent PPO trainers, asynchronous rollout and pipeline-training paths, and the execution flow that converts multi-turn agent-environment interaction into PPO and GRPO training batches.
- Owned the training pipeline that ingests rollout trajectories, computes advantage, and performs policy updates plus rollout-model weight synchronization for coding-agent post-training.
- Implemented asynchronous rollout, replay-queue mini-batching, and router-assisted batching between rollout and training workers to overlap trajectory generation with policy optimization.
- Developed trajectory/data transforms, token-level loss masks, stepwise-vs-trajectory advantage handling, rejection sampling, and batch balancing to improve GRPO signal quality and training stability.
- Scaled long-context training recipes to 16K-32K contexts using Ulysses sequence parallelism, remove-padding, chunked prefill, and per-GPU token-budget tuning for DeepCoder and DeepScaleR-style runs.
- Implemented sequence-parallel (SP) compatibility across the training stack so long-context post-training paths worked correctly with distributed attention, packed sequences, and rollout-to-training data flow.
- Built SP-compatible MoE-LoRA kernel paths to support efficient distributed post-training for expert models without breaking sequence-parallel execution.
- Integrated QuACK fused kernels into XoRL to improve kernel efficiency and support higher-throughput post-training recipes.
- Added Qwen3.5 support and completed model bring-up across configs, training paths, and distributed recipes for reliable experimentation.
- Diagnosed and fixed multi-node training failures (position_ids, cu_seqlens, attention-mask, and MoE dispatch issues) that destabilized distributed recipes across evolving model families.
- Integrated long-context attention (Ulysses, Ring Attention) into Axolotl and supported SFT data flow from successful trajectories to extend supervised post-training to larger context windows.
Inference System — Pulsar & SGLang
Motivation. High-throughput serving requires lower KV overhead and more stable speculative decoding across cache-hit patterns, batch sizes, and multi-node deployments.
Contributions.
- Applied a Swift-KV caching strategy to accelerate prefill by reducing KV memory overhead and improving end-to-end latency.
- Designed and implemented KV-cache prompt caching for the Phoenix speculator in Pulsar, stabilizing acceptance rates and reducing end-to-end latency.
- Resolved tokenizer chat-template issues and Docker deployment bugs for reliable multi-node operation, then benchmarked cache behavior across batch sizes and cache-hit scenarios to explain acceptance-rate variability and optimize cache-hit logic.
- Integrated and implemented Llama 4 support for sliding window attention.
AgentGo
Motivation. Tool-using agents alternate between long-context reasoning and external actions, but request-centric runtimes either evict useful KV state too early or waste memory by pinning it too long.
Contributions.
- Co-developed the core idea of treating multi-turn agent workflows as first-class programs rather than isolated requests.
- Helped build the staged system path from telemetry and shadow prediction to offline replay, observability, and config-gated runtime integration for prediction-aware scheduling.
Hierarchical Performance Isolation for Distributed LLM
Motivation. Multi-tenant LLM serving needs hierarchical fairness and performance isolation across shared instances and clusters without sacrificing throughput.
Contributions.
- Contributed to design discussions around hierarchical fairness, vruntime-style accounting, and weight partitioning across distributed serving instances.
- Participated in experiments evaluating performance isolation and fairness under multi-tenant LLM serving workloads.
Modeling
CoderForge
Motivation. High-quality coding agents require strong trajectory data, stable post-training pipelines, and task-aligned optimization objectives for code generation.
Contributions.
- Led the training pipeline for OpenHands R2E-Gym & SWE-Bench-scale data: curated high-signal SWE-smith / Rebench examples and fixed attention-mask plus position-ID issues in XoRL.
- Distilled Qwen3-480B trajectories into a 30B coding model via supervised fine-tuning and activation distillation, then initiated MoE / RL scaling for Qwen3-30B to improve SWE-Bench solve rates.
- Designed per-token loss formulations for coding-trajectory distillation and model-quality improvement.