May 20, 2026 EN #LLM Inference #LLM Serving #Pipeline Parallelism

Sarathi-Serve: Taming the Throughput–Latency Tradeoff in LLM Inference — Technical Review

Sarathi-Serve: Taming Throughput–Latency Tradeoff in LLM Inference — Technical Review

Review date: 2026-05-20 Review author: Zhongzhu Zhou Paper reviewed: Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve Paper authors: Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee (Microsoft Research India + Georgia Tech) arXiv: 2403.02310v3, 2024-06-17 Venue: USENIX OSDI 2024

Short answer

If you have looked into how production LLM serving stacks actually batch tokens — vLLM, TensorRT-LLM, SGLang, NVIDIA Triton — you have almost certainly seen the phrase chunked prefill. That phrase comes from this paper.

Before Sarathi-Serve, the standard story for “fast LLM serving” was: use iteration-level batching (Orca, 2022), keep batches as large as memory permits, and admit new requests eagerly so subsequent decodes run at high batch size. That story is what gave us vLLM. It is also the story that produces the multi-second generation stalls you can see in Figure 1a of the paper — every time a new prompt joins, the GPU pauses all on-going decodes to do a giant prefill.

Sarathi-Serve’s diagnosis is that the bottleneck is not memory and not compute — it is the batching policy. Two ideas suffice:

Chunked-prefills. Don’t run a prefill in a single iteration. Slice the prompt into fixed-size chunks (say, 512 tokens each) and feed one chunk per iteration. This turns one heavy iteration into several light, predictable iterations.
Stall-free scheduling. Inside one iteration, pack on-going decodes together with one prefill chunk into a single forward pass — what the paper calls a hybrid batch — provided the total token count stays under a token budget $\tau$ chosen so the iteration finishes inside the TBT (time-between-tokens) SLO.

The empirical result is striking. For Mistral-7B on one A100, capacity rises 2.6× at the same tail-latency target. For Yi-34B on two A100s, 3.7×. For Falcon-180B with TP4×PP2 over a 100 Gbps Ethernet rack, the gain reaches 5.6× — because chunked, near-uniform batches also kill pipeline-parallel bubbles. The paper is also unusually clean: only two parameters (chunk size and token budget) and a single scheduling algorithm explain all of it.

This note (a) builds the prerequisites for readers who only know “vLLM is fast for LLMs,” (b) walks through chunked-prefill arithmetic carefully, (c) summarizes the experiments and the few caveats that I think matter, and (d) places Sarathi-Serve against the rest of the LLM-serving literature I have been reviewing on this blog (DistServe, Splitwise, KV-Fold, PipeSD).

1. Prerequisites

Sarathi-Serve sits at the intersection of three sub-fields: LLM architectures, request-level systems, and GPU performance modeling. Without the right backdrop the paper reads like a grab-bag of small tricks. With it, the paper reads like one careful observation followed by its inevitable consequence. I’ll sketch what you need.

1.1 The two phases of LLM inference

A decoder-only transformer generates text in two distinct phases:

Prefill. Given the prompt of length $L_p$ , the model performs one forward pass that consumes all $L_p$ tokens in parallel and emits the first output token. Every layer of the model fires once. Every position computes its own attention over the rest of the prefix. This phase is compute-bound because for any non-trivial $L_p$ (say, 1024 tokens), the GPU has enough work to saturate its FP16 tensor cores.
Decode. After the first token, the model autoregressively emits the next token, then the next, one at a time. Each decode step is a forward pass with input length one. Almost all of the cost is fetching the model weights from HBM — the actual math is tiny. This phase is memory-bound and benefits dramatically from batching, because every additional concurrent decode in a batch amortizes the same weight-fetch over more arithmetic.

Figure 3 of the paper shows that decode throughput grows nearly linearly with batch size, while prefill throughput is essentially flat (Mistral-7B, A100, prompt length 1024). This contrast is the entire story of the paper: prefills and decodes have different bottlenecks, and any scheduler that pretends otherwise will lose either throughput or latency.

1.2 Iteration-level batching (Orca, 2022)

The pre-Orca world batched requests at request granularity. You picked $B$ requests, ran prefill on all of them, then ran decode on all of them, and the batch did not finish until the last request in it stopped emitting. This wasted GPU cycles whenever requests had different output lengths.

Orca’s contribution was iteration-level batching: at every iteration, the scheduler is free to add or remove requests. A request that just finished decoding can leave the batch. A new request that just arrived can join. This is what made high-throughput LLM serving viable, and every modern engine (vLLM, TensorRT-LLM, SGLang, MII) inherits it.

But Orca leaves one question open: when a new request arrives, do you prefill it immediately, or wait?

1.3 The vLLM batching policy

vLLM (Kwon et al., SOSP 2023) introduced PagedAttention — paging the KV-cache like an OS pages virtual memory, eliminating fragmentation and allowing far larger batch sizes. Its scheduler is prefill-prioritizing: as soon as KV-cache memory permits a new request, vLLM pauses the on-going decode batch, executes the full prefill of the new request, and only then resumes decoding. The rationale is sound: a larger decode batch is dramatically more efficient, so paying for the prefill quickly is worth it.

The problem is that the prefill of a long prompt (say, 8 K tokens) can take seconds. During those seconds, every other user’s decode is paused. The paper calls this a generation stall, and Figure 1a is the smoking gun: in vLLM, you can see prompts being generated, then a flat plateau lasting a few seconds, then generation resuming. From a user’s perspective, the model “freezes” mid-sentence.

1.4 Time-to-first-token vs time-between-tokens

LLM serving exposes two distinct latencies that don’t track each other:

TTFT (time-to-first-token). From request arrival to the first output token. Dominated by queuing + prefill cost.
TBT (time-between-tokens). Latency between consecutive output tokens within one stream. Dominated by per-iteration cost during decoding.

A good user experience needs both TTFT and tail TBT inside SLOs. The cruel thing about LLM serving is that fixing TTFT (prefill immediately) worsens TBT (because of generation stalls), and fixing TBT (decode-prioritizing, no admission) worsens TTFT (because new requests queue forever). This is the throughput–latency tradeoff Sarathi-Serve takes its name from.

1.5 Pipeline parallelism and bubbles

When a model is too big for one GPU even with tensor parallelism (TP), the natural fallback is pipeline parallelism (PP): split layers across stages, stream micro-batches through. PP works beautifully when micro-batches have uniform compute. But in LLM serving, micro-batches contain a mix of prefills and decodes with wildly different shapes, so the per-stage time fluctuates and bubbles form: a downstream stage finishes early and stalls waiting for the upstream stage’s next micro-batch.

PP bubbles are a different problem from generation stalls, but Sarathi-Serve’s solution attacks both simultaneously — once every micro-batch contains the same number of tokens, every stage takes about the same time, and bubbles shrink to a few percent.

1.6 Arithmetic intensity and the compute-bound knee

A linear layer’s runtime can be approximated as $T = \max(T_{\mathrm{math}}, T_{\mathrm{mem}})$ . When you have $n$ input tokens going through a matmul with $d_\text{in} \times d_\text{out}$ weights:

$T_{\mathrm{math}} = \frac{2 \cdot n \cdot d_\text{in} \cdot d_\text{out}}{\mathrm{FLOPS}_\text{peak}}, \quad T_{\mathrm{mem}} = \frac{d_\text{in} \cdot d_\text{out} \cdot \mathrm{bytes}_\text{w} + n \cdot d_\text{in} \cdot \mathrm{bytes}_\text{a}}{\mathrm{BW}_\text{peak}}.$

For small $n$ , $T_{\mathrm{mem}} > T_{\mathrm{math}}$ and the kernel is memory-bound: doubling $n$ barely changes $T$ . For large $n$ , $T_{\mathrm{math}} > T_{\mathrm{mem}}$ and the kernel is compute-bound: $T$ grows linearly in $n$ . The transition happens at $n \approx \mathrm{FLOPS}_\text{peak} / \mathrm{BW}_\text{peak} \cdot \mathrm{bytes}_\text{a}^{-1}$ , which for A100 + FP16 sits around 128–512 tokens.

Figure 5 and Figure 6 of the paper plot exactly this curve. Pure-decode batches sit far below the knee. Pure-prefill batches sit far above it (wasting bandwidth). The whole intuition for chunked-prefill is put the batch right on the knee — that point is also where you simultaneously maximize MFU (FLOPs utilization) and MBU (bandwidth utilization).

With those six pieces in mind, the rest of the paper is essentially “pack prefill chunks and decodes together until you hit the knee, but no further.”

2. Why current schedulers fail

The paper spends Sections 2–3 carefully cataloguing the failure modes of prior work. I’ll summarize and add my own take.

2.1 Decode-prioritizing schedulers (FasterTransformer, Triton, request-level batching)

Pattern: collect a batch, prefill all of it, decode all of it, never admit new requests mid-batch.

Failure mode: the batch shrinks as fast users finish, but slower users keep the GPU half-empty until the last request completes. Throughput is awful.

2.2 Prefill-prioritizing schedulers (Orca, vLLM, TensorRT-LLM defaults)

Pattern: iteration-level batching with an “admit on memory available” policy. Every time KV-cache opens up, run the next prefill before resuming decodes.

Failure modes are subtle and they compose:

Generation stalls (Section 3.2). When a new request arrives with a long prompt, the prefill iteration takes hundreds of milliseconds to seconds. During this iteration all on-going decodes pause. Worse, the longer the prompt, the longer the stall, so tail TBT scales with worst-case prompt length, not average.
Pipeline-parallel bubbles (Section 3.3). With PP and mixed prefill/decode iterations, each stage processes batches of different sizes, so cross-stage timing is uneven. Even if you size micro-batches carefully, the iteration-to-iteration variance is enough to leave $20$ – $40\%$ of the pipeline idle.
Sub-optimal arithmetic intensity. Pure-prefill iterations are above the knee — extra compute is wasted because bandwidth is the binding resource for the weights load. Pure-decode iterations are below the knee — extra bandwidth is wasted because compute is binding for so few tokens. Either way, you don’t sit at the optimum.

The key empirical claim in Section 3 is that no choice of micro-batch size can simultaneously fix all three. Generation stalls and pipeline bubbles trade off against each other inside the prefill-prioritizing design space. The fix has to break that design space.

2.3 Chunked-prefill is the design-space exit

If a prefill of $L_p$ tokens must complete in one iteration, you have to pay $L_p$ ‘s worth of latency in that iteration. But if you allow the prefill to be split across multiple iterations of size $C$ each, each iteration is bounded by $C$ , regardless of $L_p$ . This is the single load-bearing observation of the paper. The cost is that prefill is no longer “one shot” — but the cost in extra attention compute is small and analyzable (Section 4.1).

3. Method

The Sarathi-Serve scheduler has two ideas. Both are simple.

3.1 Chunked-prefills (the core idea)

A prefill of length $L_p$ is split into $\lceil L_p / C \rceil$ chunks of size $C$ . Chunk $i$ processes the substring of tokens $[(i-1)C, iC)$ and writes their KV entries. Chunk $i$ ‘s attention computation must use the KV entries of all previous chunks of the same prompt — so attention cost scales quadratically in the cumulative prefix length, but FFN/linear cost scales only with $C$ .

The math is worth doing out loud. Say a prompt of length $L_p = 4096$ is split into 8 chunks of $C = 512$ . Total FFN cost is the same as one shot ( $L_p$ tokens). Total attention cost goes from $\Theta(L_p^2)$ (one shot) to $\Theta(\sum_{i=1}^{8} 512 \cdot 512 \cdot i)$ , which is also $\Theta(L_p^2)$ . So linear cost is unchanged and attention cost differs by at most a constant factor (the paper measures it at $<3\%$ extra at $C = 512$ , which I find believable given memory locality effects in flash-attention kernels).

Choosing $C$ is a tradeoff:

Smaller $C$ → more iterations → more attention-recompute overhead → smaller per-iteration TBT.
Larger $C$ → fewer iterations → less recompute → larger per-iteration TBT.

For Mistral-7B on an A100, the sweet spot is around $C = 512$ ; for LLaMA2-70B with TP4, it shifts to $C \approx 1024$ because the per-token math is more amortized.

3.2 Stall-free hybrid batching

Within an iteration, Sarathi-Serve forms a hybrid batch that contains:

All currently-active decode requests (one token each), call that $B_d$ tokens total.
Some number of prefill-chunk tokens from possibly multiple new requests, call that $B_p$ tokens.

The constraint is $B_d + B_p \le \tau$ , where $\tau$ is the token budget. The scheduler greedily fills the budget: first admit decodes (they have to run anyway), then admit prefill chunks until the budget is exhausted or no prefill is queued.

This achieves the named property — stall-free scheduling: an on-going decode never has to wait for a new prefill to finish, because the new prefill is folded into the same iteration. The decode pays for the chunk’s compute, but only $C$ tokens worth of it, not $L_p$ worth.

It also turns every iteration into a near-uniform-shaped forward pass of $\tau$ tokens. Pipeline-parallel deployments love this: micro-batches are now the same shape across stages, bubbles collapse.

3.3 Choosing the token budget $\tau$

The token budget controls the iteration runtime. Going larger improves throughput (better arithmetic intensity, fewer iterations per prompt). Going smaller improves TBT (each iteration is faster).

The paper profiles the model on the target hardware once (Algorithm 3 in the appendix) by sweeping $n$ tokens through the model and measuring the per-iteration cost. Given a TBT SLO of $T^*$ , $\tau$ is the largest $n$ such that profiled cost $\le T^*$ . For A100 + Mistral-7B + $T^* = 100$ ms, $\tau$ lands around 1500–2000. For LLaMA2-70B on TP4 + 200 ms TBT, $\tau$ goes up into the thousands.

There is a beautiful side-effect: because $\tau$ is calibrated to the linear-layer knee, the same $\tau$ also maximizes MFU. In other words, the SLO-respecting choice of $\tau$ happens to be the throughput-maximizing one. The paper frames this as the central insight (Figure 5 shows the Sarathi-Serve operating point sitting right at the elbow), and I find this claim genuinely elegant.

3.4 The scheduling algorithm in full

Putting the pieces together (the paper’s Algorithm 3 is short and worth showing close to verbatim, in pseudocode):

loop:
    B ← empty batch
    # 1. Admit all currently active decodes
    for r in active_decodes:
        B.add_decode(r)
    n ← number_of_tokens(B)
    # 2. Fill remaining budget with prefill chunks
    while n < tau and prefill_queue not empty:
        r ← prefill_queue.peek()
        remaining ← chunk_size_for(r)  # = C, or smaller for last chunk
        take ← min(remaining, tau - n)
        B.add_prefill_chunk(r, take)
        n ← n + take
        if take == remaining and prefill_done(r):
            prefill_queue.pop()
            active_decodes.add(r)
    # 3. Execute one forward pass on B
    run_iteration(B)

There is no fancy optimization, no learned scheduler, no reinforcement learning. The whole policy is “fill the budget.”

3.5 What about admission control and KV-cache memory pressure?

Sarathi-Serve does not change PagedAttention’s memory management; the KV-cache management is orthogonal. Admission control (whether to accept a new request at all) is still policy-driven and configurable. The contribution is purely the inside-the-batch scheduling.

4. Evaluation

4.1 Setup

Models. Mistral-7B (single A100, no parallelism), Yi-34B (2× A100, TP2), LLaMA2-70B (8× A40, TP8 — but tested with hybrid configs), Falcon-180B (8× A100 across 2 nodes, TP4×PP2 over 100 Gbps Ethernet — a deliberately commodity setup to stress PP).
Workloads. Two production-style traces: openchat_sharegpt4 (chatbot, median prompt 1730 tokens) and arxiv_summarization (long-form, median prompt 7059 tokens). The traces matter — short-prompt and long-prompt regimes stress different parts of the scheduler.
Baselines. Stock vLLM (the obvious comparison), Orca (faithfully reimplemented because the original is not open), FasterTransformer (request-level baseline). For PP experiments, a hand-tuned vLLM-PP and a Megatron-style PP baseline.
SLOs. Strict (TBT P99 ≤ 200 ms) and relaxed (TBT P99 ≤ 500 ms) targets, parameterized per-model.

4.2 Headline capacity numbers

The headline numbers (Section 5.1 of the paper) are:

Mistral-7B, single A100. 2.6× higher RPS capacity under strict TBT than vLLM.
Yi-34B, TP2. 3.7× capacity under strict TBT than vLLM.
Falcon-180B, TP4×PP2 over Ethernet. 5.6× capacity under strict TBT than vLLM-PP.

“Capacity” here is the maximum sustained RPS at which P99 TBT stays under the SLO. The bigger gain on Falcon-180B is precisely because PP bubble reduction adds on top of the chunked-prefill benefit. The smaller gain on Mistral-7B is because a single-GPU stack has no bubbles to recover.

4.3 The latency vs throughput tradeoff curves

Figure 8 and Figure 9 plot RPS vs P99 TBT. The two key shape changes from vLLM to Sarathi-Serve are:

The knee of the curve shifts right (capacity grows). Sarathi-Serve sustains 1.5–4× more RPS at the same TBT.
The slope after the knee is gentler. Past saturation, vLLM’s TBT explodes (queueing); Sarathi-Serve degrades more gracefully because per-iteration variance is bounded.

Both shape changes matter. The first is the eye-catching “we’re faster” claim. The second is what you actually want under bursty load — overshoot capacity by 10% and a vLLM cluster falls off a cliff while a Sarathi-Serve cluster slows down by 20%.

4.4 Where the gains come from (ablation)

Section 5.4 ablates the two ideas:

Stall-free batching alone (no prefill chunking, just merge full prefills with decodes) yields about half the latency gain — the stall is the dominant tail effect.
Chunked prefills alone (chunk prefills but don’t merge with decodes) yields the throughput gain on the single-node setup but not the PP bubble fix.
Both together are super-additive on PP setups because uniform iterations are what eliminates bubbles.

4.5 Sensitivity to chunk size and budget

Figure 11 sweeps $C$ for Yi-34B. Below $C = 128$ , attention re-overhead kills throughput. Above $C = 2048$ , TBT inflates. The flat region in between is about 1.5–2× wide on the chunk-size axis, which is wide enough that picking a chunk size is not delicate in practice.

4.6 What I would have wanted to see

Three blind spots, in declining order of importance:

Prefill-only and decode-only end-points. The paper does not formally evaluate the case where prefill volume is near zero (a saturated, decode-dominated server) or near 100% (a benchmark that does prefill-only). The two boundary regimes are where Sarathi-Serve should not help, and I would have liked to see the no-harm story made explicit.
Long-context regimes with KV-cache-bound prefill. When $L_p \ge 32$ K, the prefill is no longer linear-layer-bound — it becomes attention-bound (quadratic in $L_p$ ), and chunked-prefill’s attention re-overhead grows. The paper sidesteps long-context by using traces with $L_p \le 13$ K.
Multi-tenant fairness. Stall-free batching is great for the average user but might be unfair — a long-prompt user can monopolize multiple consecutive iterations’ prefill budget while short-prompt users wait. The paper does not separately measure per-user fairness.

None of these are fatal. They are simply the natural next questions.

5. Why this paper matters

5.1 It made a particular pattern canonical

vLLM’s default scheduler in 2024-Q2 picked up chunked-prefills directly. TensorRT-LLM added a --enable_chunked_context mode. SGLang’s RadixAttention scheduler implements a very similar token budget. NVIDIA’s internal Triton-LLM serving examples all use chunked prefills. The paper effectively standardized this pattern in the production stack.

5.2 It articulated the TTFT/TBT tradeoff precisely

Pre-Sarathi-Serve, papers tended to report a single “latency” number, often the average end-to-end latency. Sarathi-Serve makes the two-metric framing concrete: TTFT measures admission, TBT measures progression, and the tail TBT is the SLO operators actually care about. Almost every LLM serving paper since (DistServe, Splitwise, KV-Fold, SDLatencyModel, PipeSD) inherits this language.

5.3 It linked scheduling to arithmetic intensity

It would have been possible to argue for chunked-prefill purely on stall-elimination grounds. The paper’s stronger move is to argue for it on arithmetic-intensity grounds: the optimal token budget is the linear-layer compute/memory crossover, and that is independent of the workload. The SLO and the throughput optimum happen to coincide because the GPU’s linear-layer knee is what bounds both. That this insight required only Figure 5 to communicate is — to me — the paper’s biggest aesthetic achievement.

5.4 The choice to not innovate elsewhere

There is no learned scheduler, no RL, no transformer-based admission controller, no MoE, no quantization, no fancy mathematical bound. The paper deliberately does only what is necessary. Compared to the 2024-2026 trend of stacking five mechanisms in one paper, Sarathi-Serve’s restraint is refreshing.

6. Comparison to other LLM-serving work I’ve reviewed on this blog

6.1 DistServe (OSDI 2024)

DistServe disaggregates across machines: a dedicated prefill server is connected to a dedicated decode server, with KV state transferred between them. Sarathi-Serve interleaves the two phases in the same iteration on the same machine. The papers were submitted to OSDI 2024 in the same cycle and represent two genuinely orthogonal solutions to the same throughput–latency problem.

Disaggregation (DistServe, Splitwise). Strong when prefill and decode have very different hardware sweet spots, when KV-cache transfer is cheap, and when each phase can monopolize its own GPU. Best for cluster-scale serving.
Co-batching (Sarathi-Serve). Strong when a single-node deployment dominates, when KV transfer is too expensive (small clusters, no NVLink between machines), and when chunk sizing is feasible. Best for in-rack and consumer-scale serving.

Today most serving stacks use both — chunked-prefill within a stage and phase disaggregation across stages — and the two compose cleanly.

6.2 SDLatencyModel and PipeSD (my reviews on 2026-05-16 and 2026-05-17)

The latency-modeling work treats Sarathi-Serve as the black-box scheduler whose behavior you fit a queueing model to. PipeSD does cloud-edge collaborative SD and uses chunked-prefill-style scheduling for the verifier. The newer literature now assumes Sarathi-Serve as the default scheduler in much the same way physics papers assume Newton’s laws.

6.3 KV-Fold (my review on 2026-05-13)

KV-Fold attacks the memory cost of decoding; Sarathi-Serve attacks the latency cost. Compatible. KV-Fold sits orthogonal under the engine.

6.4 vLLM’s PagedAttention (Kwon et al., SOSP 2023)

PagedAttention is about KV-cache layout. Sarathi-Serve is about iteration scheduling. They compose: modern vLLM is “PagedAttention KV + Sarathi-Serve scheduler.” The combination is roughly what defines “fast LLM serving” today.

7. Limitations and open questions

I want to be careful to state these so they read as “things to think about next” not as “fatal flaws”:

7.1 Attention re-overhead grows with context

For very long prompts (32 K+), the chunked-prefill’s attention cost grows quadratically with the cumulative chunked prefix and becomes a non-trivial fraction of the iteration. The paper measures the overhead at <3% for prompts up to ~8 K, but at 64 K — especially with full attention rather than flash-attention v2 — the overhead can climb to 10–15%. A future variant could use a sparser attention pattern across chunks or schedule fewer, larger chunks for long-context regimes.

7.2 Token budget is profiled offline

The token budget $\tau$ is calibrated by an offline sweep. If your workload mix shifts (e.g., from chatbot to RAG with summarization), the calibration may drift. A small online controller (PID, EMA, bandit) over $\tau$ is the obvious extension. Some Sarathi-Serve descendants have implemented this; the paper itself leaves it to future work.

7.3 Fairness under skewed prompts

A pathological workload with one very long prompt arriving among many short ones can starve the short prompts’ first tokens (because the long prompt’s chunks keep consuming the budget). The paper does not investigate this. A weighted budget split or fair-queueing-style admission could solve it; some open-source implementations add this.

7.4 No accounting for speculative decoding

Sarathi-Serve precedes the production explosion of speculative decoding. When a decode step actually emits $E + 1 > 1$ tokens (via SD), the per-decode-step token count is higher and the calibration of $\tau$ should change. The paper does not address SD; later work has integrated chunked-prefill with SD with some care.

For multi-modal LLMs (vision-language, audio-language), the “prefill” is no longer linear in prompt token count — image tokens dominate. Chunked-prefill still applies, but the chunk-size sweep has to account for vision-encoder cost as well. The paper is text-only; multi-modal extensions are an active area.

7.6 Worked example: a single iteration on Yi-34B

To make all of the above more concrete, let me walk through one Sarathi-Serve iteration as it would actually run. Imagine a Yi-34B deployment on TP2 (two A100s with NVLink). The chunk size $C = 1024$ and token budget $\tau = 2048$ are calibrated to a TBT SLO of 150 ms. At time $t$ , the scheduler state is:

6 requests in active_decodes (each contributes 1 token). $B_d = 6$ .
prefill_queue head: a new request $R_A$ with $L_p = 4500$ token prompt that has already completed 1 chunk (so 1024 tokens already in KV, 3476 tokens remaining).
prefill_queue next: a fresh request $R_B$ with $L_p = 800$ .

The scheduler computes the available budget for prefill chunks as $\tau - B_d = 2048 - 6 = 2042$ . It then admits one chunk of $R_A$ (1024 tokens, since $R_A$ is mid-prefill), leaving 1018 budget. It then admits $R_B$ ‘s first chunk — but $R_B$ ‘s prompt is only 800 tokens, so it fits entirely. Budget remaining: 218. No more prefill in the queue (or the next prompt is too big to fit), so the iteration is sealed.

The forward pass thus sees:

6 decode tokens (1 per active request),
1024 prefill tokens (from $R_A$ , attending over its already-cached 1024 + the new 1024 = 2048 KV positions),
800 prefill tokens (from $R_B$ , attending over the new 800 positions plus prior cache, which is empty for $R_B$ ).

Total tokens: $6 + 1024 + 800 = 1830$ . The linear-layer cost is calibrated against the budget $\tau = 2048$ , so this iteration finishes in roughly 130 ms — comfortably inside the 150 ms SLO. The 6 ongoing users see exactly one decode step’s worth of latency. Crucially, $R_B$ has been admitted, and $R_A$ has progressed, and no decode has stalled.

After the iteration: $R_B$ is now in active_decodes (its prefill is done), $R_A$ has 2452 tokens of prefill left (one more 1024 chunk and a partial 1404 chunk), and the 6 original decodes each got one new token. The next iteration repeats.

The arithmetic gives intuition for why this works as well as it does: every iteration “spends” its token budget on a mix of activity that contributes to either throughput (prefill chunks moving forward) or progress (decode tokens emitted). Nothing is wasted on a stall. Nothing overshoots the SLO. Nothing leaves the GPU idle. That triple constraint is exactly what the prior schedulers couldn’t satisfy simultaneously.

7.7 What changes for Hopper-class GPUs

A natural question is whether anything changes when we move from A100 (Ampere) to H100 (Hopper). The H100 increases peak FP16 FLOPs by ~3× over A100 but only ~1.5× HBM bandwidth, which shifts the linear-layer arithmetic-intensity knee to the right: the knee moves from ~256 tokens (A100) to ~512 tokens (H100). FP8 support shifts it further still — to ~1024 tokens — because FP8 doubles effective FLOPs per byte of activation.

For Sarathi-Serve, this means $\tau$ should be larger on H100 than on A100. Empirical reports from production deployments (vLLM v0.5+, TensorRT-LLM 0.10+) confirm this: A100 $\tau$ values in the 1500–2000 range translate to H100 $\tau$ values in the 3000–4000 range for the same TBT SLO. The scheduler algorithm itself is unchanged — only the offline profile shifts. This is the “robust to hardware” property the paper claims.

The Blackwell generation (B200, 2025) pushes the knee further again, and FP4 (introduced in B200) pushes it even more. The Sarathi-Serve pattern scales as long as the offline profile is re-run; the design space and algorithm survive.

8. Reproducibility notes

The Sarathi-Serve source code is open at https://github.com/microsoft/sarathi-serve and is in active use by Microsoft Research and Azure ML. Reproducing the paper end-to-end requires:

A100 or A40 GPUs (Hopper architecture also works, with profiled $\tau$ ).
The traces are public (openchat_sharegpt4, arxiv_summarization).
The Falcon-180B run requires 8 GPUs across 2 nodes. Smaller-budget reproductions (Mistral-7B on a single A100) reproduce the core 2.6× headline cleanly.

Subsequent open-source implementations in vLLM (--enable-chunked-prefill), TensorRT-LLM, and SGLang make it easy to reproduce the technique without running the original codebase. For research that wants to compare against the exact Sarathi-Serve scheduler, the original repo is still the reference.

9. Personal takeaways

A few things I will take with me from this read:

The right metric to optimize is tail latency under capacity, not average latency. Sarathi-Serve’s experiments are dispositive on this. A scheduler with great average latency can have a P99 that is 10–50× worse, which is what users actually feel.
Match the batch size to the hardware knee, not to memory. The PagedAttention generation maxed batch size against KV-cache. Sarathi-Serve maxes token count against the linear-layer knee. The second criterion is the one that scales.
Uniform iterations are the secret to pipeline parallelism. I had previously thought PP bubbles were a fundamental cost. The chunked-prefill trick is also a generalization: any system that wants to use PP for inference should aim for near-uniform-shape micro-batches first.
Simple, necessary mechanisms beat compound, sufficient ones. Two ideas, two parameters, three figures of speedup. That ratio of explanatory ideas to outcomes is what I aspire to in my own systems papers.

10. Verdict and where to read next

Sarathi-Serve is, in my view, the most important LLM-serving systems paper of 2024. It is the paper I would assign first to a new graduate student joining an LLM serving group. The technique is now table stakes in every production stack, and the framing — chunk + budget + uniform iteration — generalizes well beyond LLMs.

Suggested reading order if you find this interesting:

Sarathi-Serve (this paper). The scheduler core.
vLLM / PagedAttention (Kwon et al., SOSP 2023). The KV-cache substrate.
DistServe (Zhong et al., OSDI 2024). The disaggregation alternative.
Splitwise (Patel et al., ISCA 2024). The same disaggregation idea from the Microsoft side.
KV-Fold (Wang et al., 2026). The orthogonal memory-cost angle.
SDLatencyModel (Kong et al., 2026). A descriptive queueing model that treats Sarathi-Serve as the default.

After all that, you have the full mental model of how a modern LLM serving stack actually batches tokens. From there, the next frontier is hybrid SD + chunked-prefill + disaggregation, which is where serving infrastructure is heading in 2026 and which several of my recent reviews have started to map.

This review was written for an audience comfortable with transformer inference and basic GPU performance modeling, but not necessarily familiar with the LLM serving stack. If you have feedback or corrections, please reach out.