Review date: 2026-05-28 Review author: Zhongzhu Zhou Paper reviewed: Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Paper authors: Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu arXiv: 2407.00079 Status: FAST 2025 Best Paper Award; ACM Transactions on Storage
Short Answer
Mooncake is Kimi’s production LLM serving platform, built around a single insight: the KV cache is the central scarce resource in LLM serving, so the entire system architecture should be organized around managing, reusing, and moving it efficiently. The paper proposes disaggregating the prefill cluster from the decoding cluster, maintains a distributed KV cache pool in CPU DRAM and SSD across the entire fleet, and uses a global “Conductor” scheduler that makes every scheduling decision in terms of KV cache placement and transfer cost. On real Kimi workloads, this architecture handles 75% more requests than a vLLM baseline at the same SLO thresholds, and achieves up to 525% throughput improvement in simulated long-context scenarios.
Prerequisites
This review assumes familiarity with the basics of transformer inference. Here is a compact recap of everything you will need before the paper’s innovations make sense.
1. Transformer Autoregressive Generation
A decoder-only language model like GPT generates text token-by-token. Given an input prompt of tokens, the model first runs a prefill (also called initiation) pass: it processes all prompt tokens in parallel and caches their computed key-value tensors. Then, in the decoding (also called increment) phase, the model generates one new token per forward pass, each time attending over all previous tokens. Because all previous key and value tensors are cached (the KV cache), each decode step only computes attention for a single new query vector.
2. KV Cache Structure
In each transformer layer, every token produces a key vector and a value vector . For a model with layers, attention heads of dimension , and half-precision (FP16) storage, the KV cache for a single token takes:
For LLaMA2-70B (, , , GQA with 8 KV heads):
A 128k-token context therefore requires about 40 GB of KV cache — half the VRAM of an A100-80GB.
3. Prefill Compute Complexity
In the prefill pass, computing self-attention for prompt tokens has complexity:
- Attention: — quadratic in sequence length
- MLP / FFN: — linear in sequence length
This quadratic scaling means that doubling the input length quadruples the attention compute in the prefill pass. For very long contexts (16k–128k tokens), prefill dominates GPU time and blocks the decoding cluster.
4. Service-Level Objectives (SLOs)
Production LLM services define two key latency metrics:
- TTFT (Time to First Token): Time from request arrival to the first generated token. This is dominated by the prefill phase.
- TBT (Time Between Tokens): Time between consecutive generated tokens during decoding. This is dominated by memory bandwidth (reading the full KV cache at each decode step).
These two objectives are naturally in tension: batching many requests together improves throughput but increases individual request latency.
5. Continuous Batching (Orca, 2022)
Before Mooncake, the state of the art was Orca’s iteration-level scheduling: instead of running a fixed batch of requests from start to finish, the scheduler picks requests at every generation step, letting finished requests exit and new requests enter dynamically. This eliminated the “batch waits for the slowest request” bottleneck and dramatically improved GPU utilization.
However, Orca couples prefill and decoding in the same cluster — a long prefill for a new request can stall ongoing decoding iterations, degrading TBT latency. Mooncake removes this coupling.
6. Disaggregated Prefill–Decode (DistServe, 2024)
DistServe (reviewed separately) showed that physically separating prefill and decoding into different GPU pools is beneficial: the two phases have very different hardware requirements (compute-intensive vs. memory-bandwidth-intensive), and separating them prevents interference. Mooncake builds directly on this idea, but adds the KV cache as a first-class architectural component.
7. PagedAttention and KV Cache Management
vLLM introduced PagedAttention, which stores KV caches in non-contiguous, fixed-size “pages” (blocks) rather than requiring contiguous GPU memory. This eliminates internal fragmentation and allows fine-grained memory sharing. Mooncake extends this idea to a distributed, hierarchical (GPU VRAM → CPU DRAM → SSD) KV cache pool spanning the entire cluster.
8. RDMA Networking
RDMA (Remote Direct Memory Access) allows one node to read from or write to another node’s memory without involving the remote CPU. RDMA NIC hardware drives the transfer directly, at near-network-wire-speed with very low CPU overhead. Modern GPU clusters use RDMA fabrics (InfiniBand or RoCE) between nodes; within a node, NVLink connects GPUs. Mooncake’s Transfer Engine exploits GPUDirect RDMA: the GPU memory is directly exposed to the RDMA fabric, so KV tensors can move GPU-to-GPU across nodes without a CPU bounce copy.
Why Existing Systems Fall Short
The Coupled Prefill–Decode Problem
gantt
title Coupled Prefill-Decode (vLLM / Orca style)
dateFormat X
axisFormat %s
section Request A (long input)
Prefill (64k tokens) :active, a1, 0, 120
Decode step 1 :a2, 120, 130
Decode step 2 :a3, 130, 140
section Request B (arrives at t=50)
Waiting for prefill to finish :crit, b1, 50, 120
Prefill (short) :b2, 120, 130
Decode step 1 :b3, 130, 140
section Request C (decode-only)
Decode step 1 (stalled) :crit, c1, 50, 120
Decode step 2 :c2, 120, 130
Decode step 3 :c3, 130, 140
Figure 1 – Coupled interference. When a long-input request starts its prefill (120 time-units), all decode-only requests (Request C) and new arrivals (Request B) are blocked. Request C’s TBT SLO may be violated even though it needs no prefill at all.
In a production system with diverse context lengths (some requests have 128k-token prompts, others have 200-token prompts), this interference is severe. The paper reports that on the real Kimi workload, vLLM achieves only 57% TBT SLO compliance when long-context requests are in flight — 43% of decode intervals miss their latency target.
The Memory Reservation Problem
Orca-style schedulers reserve KV cache GPU memory at request admission time based on max_tokens. If a request is budgeted 2,048 output tokens but finishes in 30, those 2,017 token slots of VRAM sit idle for the duration. This over-reservation leads to low effective VRAM utilization and conservative admission control, further limiting throughput.
The Reuse Opportunity
A critical insight from production data: many requests share common prefixes. System prompts, few-shot examples, and popular document prefixes repeat across thousands of user requests. If the KV cache for a common prefix were stored once and reused, the compute for those tokens would be entirely avoided. vLLM’s RadixAttention / prefix caching implements a local version of this idea, but it only reuses KV cache within a single serving instance. Mooncake extends prefix caching across the entire cluster.
Mooncake Architecture Overview
graph TB
subgraph Client
R[Request Stream]
end
subgraph Conductor["Conductor (Global Scheduler)"]
C[Scheduler Logic<br/>TTFT Estimation<br/>SLO Validation<br/>Early Rejection]
end
subgraph PrefillPool["Prefill Pool (compute-optimized)"]
P1[Prefill Instance 1]
P2[Prefill Instance 2]
P3[Prefill Instance N]
end
subgraph DecodingPool["Decoding Pool (memory-optimized)"]
D1[Decoding Instance 1]
D2[Decoding Instance 2]
D3[Decoding Instance M]
end
subgraph KVStore["Distributed KV Cache Pool"]
GPU[GPU VRAM<br/>active requests]
DRAM[CPU DRAM<br/>cached blocks]
SSD[SSD<br/>cold blocks]
end
subgraph Messenger["Transfer Engine (Messenger)"]
RDMA[GPUDirect RDMA<br/>800 Gbps]
end
R -->|route| C
C -->|assign prefill| PrefillPool
C -->|assign decode| DecodingPool
PrefillPool <-->|KV read/write| DRAM
DecodingPool <-->|KV read| DRAM
DRAM <-->|eviction| SSD
PrefillPool <-->|transfer| RDMA
RDMA <-->|transfer| DecodingPool
PrefillPool <-->|local| GPU
DecodingPool <-->|local| GPU
Figure 2 – Mooncake system architecture. The Conductor is the global brain; the Prefill Pool and Decoding Pool are physically separate GPU clusters; the distributed KV cache pool spans CPU DRAM and SSD; the Transfer Engine (Messenger) moves KV blocks across machines via RDMA.
The KV Cache as First-Class Citizen
The fundamental shift Mooncake makes is conceptual: instead of treating KV cache as a side-effect of serving requests, it makes KV cache the primary resource around which scheduling, placement, and data movement are organized.
Every scheduled request follows this pipeline:
sequenceDiagram
participant Client
participant Conductor
participant KVPool as KV Cache Pool (CPU DRAM)
participant Prefill as Prefill Instance
participant Messenger as Transfer Engine
participant Decode as Decoding Instance
Client->>Conductor: New request (prompt_tokens, max_tokens, SLO)
Conductor->>KVPool: Prefix hash lookup (find cached prefix blocks)
KVPool-->>Conductor: cache_hit_len (# tokens already cached)
Conductor->>Conductor: Estimate TTFT = T_queue + T_prefill(L - cache_hit_len) + T_transfer
Conductor->>Conductor: SLO check (TTFT < TTFT_SLO? TBT feasible?)
alt SLO feasible
Conductor->>Prefill: Assign request + cached prefix info
Prefill->>KVPool: Load prefix KV blocks into GPU VRAM (async)
Prefill->>Prefill: Compute prefill for remaining tokens (incremental)
Prefill->>KVPool: Store new KV blocks (hash, layer-by-layer)
Prefill->>Messenger: Stream KV blocks to decoding node
Messenger->>Decode: Transfer KV blocks via RDMA
Decode->>Decode: Begin autoregressive decoding
Decode->>Client: Stream generated tokens
else SLO infeasible
Conductor->>Client: Early reject
end
Figure 3 – Request lifecycle in Mooncake. The critical path from request arrival to first token includes a prefix cache lookup, incremental prefill (skipping already-cached tokens), KV transfer, and decoding.
KV Cache Storage and Hashing
KV cache blocks are stored as paged units — fixed-size blocks (e.g., 16 tokens worth of KV tensors per block) — across the hierarchy. Each block is tagged with a hash key derived from its token content:
This content-addressed scheme enables deduplication: if two requests share a common prefix, their prefix blocks have identical hashes and can be identified as the same data without comparing the raw tensors. The Conductor uses these hashes for two purposes:
- Prefix matching at scheduling time: Find which prefill instance already has the most blocks matching this request’s prefix.
- Cache eviction / replication: Track which blocks are hot (many hits) and replicate them; evict cold blocks to SSD.
Transfer Engine: Moving KV Across the Cluster
The Transfer Engine (called “Messenger” in the paper) is the component that makes KV cache disaggregation practical. Without fast KV transfer, the cost of sending computed KV tensors from a prefill node to a decoding node would exceed the savings from prefix caching and disaggregation.
Hardware: GPUDirect RDMA
GPUDirect RDMA allows a GPU’s HBM to be directly exposed to the RDMA fabric. When a prefill node sends a KV block to a decoding node, the RDMA NIC reads directly from GPU HBM without involving the host CPU or PCIe. Key characteristics of the testbed:
- 800 Gbps RDMA between nodes (InfiniBand HDR or RoCE v2)
- NVLink within-node (up to 900 GB/s aggregate within an 8-GPU node)
- 8 × A800-SXM4-80GB GPUs per node (80 GB HBM each)
Transfer time for a KV block of size bytes is approximately:
For a LLaMA2-70B 16k-token context: , (800 Gbps), so . This is acceptable for a TTFT SLO of 30 seconds.
Layer-wise Streaming and Overlap
A naive approach would complete the entire prefill, then transfer all KV blocks to the decoding node, then start decoding. This wastes time: decoding cannot begin until all transfers finish. Mooncake uses layer-wise async streaming:
gantt
title Layer-wise Prefill KV Transfer (Overlap)
dateFormat X
axisFormat %s
section Prefill Compute
Layer 1-20 compute :a1, 0, 20
Layer 21-40 compute :a2, 20, 40
Layer 41-60 compute :a3, 40, 60
Layer 61-80 compute :a4, 60, 80
section KV Transfer (async)
Transfer layers 1-20 KV :b1, 20, 35
Transfer layers 21-40 KV :b2, 40, 55
Transfer layers 41-60 KV :b3, 60, 75
Transfer layers 61-80 KV :b4, 80, 95
section Decoding (starts early)
Wait for all layers :crit, c1, 0, 95
Decode step 1 :c2, 95, 100
Figure 4 – Layer-wise KV transfer overlap. Prefill computes each group of layers, then immediately starts streaming those KV tensors to the decoding node. The total transfer completes shortly after prefill finishes, reducing end-to-end TTFT compared to a monolithic (compute-all then transfer-all) approach.
The paper shows that for 128k-token contexts, layer-wise overlap reduces KV store latency by 30–60% compared to naive bulk transfer.
A second benefit of layer-wise prefill is VRAM independence: because KV tensors are streamed out as they are computed, the prefill GPU does not need to hold all 80 layers’ worth of KV in VRAM simultaneously. It only needs to buffer one layer group at a time. This means prefill can handle arbitrarily long contexts as long as VRAM can hold a single layer group — a major scalability improvement.
The Conductor: Global KV-Cache-centric Scheduler
The Conductor is the central brain of Mooncake. It runs as a separate service and makes all request routing decisions. Its core algorithm is reproduced and expanded below.
Inputs and State
- : the set of prefill instances
- : the set of decoding instances
- : an incoming request with fields:
prompt_tokens: token IDs of the input promptmax_tokens: upper bound on output lengthTTFT_SLO,TBT_SLO: latency requirements
cache_block_size: number of tokens per KV block
Step-by-Step Algorithm
Algorithm: KVCache-Centric Scheduling (Conductor)
Input: P (prefill pool), D (decode pool), R (request), B (block size)
Output: (p*, d*) = chosen (prefill_instance, decoding_instance), or REJECT
1. block_hashes ← PrefixHash(R.prompt_tokens, B)
// Generate the sequence of block hash keys for R's prompt.
// block_hashes[i] is the hash of the i-th KV block.
2. best_TTFT ← ∞
best_p ← None
best_cache_len ← 0
3. FOR each prefill instance p ∈ P:
4. // Determine how many prefix blocks this instance already holds
5. cache_len[p] ← PrefixMatch(p.cached_blocks, block_hashes)
// cache_len[p] = number of tokens already cached at instance p
6. T_queue[p] ← EstimatePrefillQueueTime(p)
// Based on current queue depth and estimated per-token prefill speed
7. T_prefill[p] ← EstimatePrefillExecTime(len(R.prompt) - cache_len[p])
// Prefill exec time is linear in # uncached tokens (ignoring quadratic
// attention for simplicity; actual model accounts for full attention)
8. IF cache_len[p] < kvcache_balancing_threshold × len(R.prompt):
// Not enough local cache to justify loading from here;
// look for a remote source to transfer from
9. FOR each other instance p' with cache_len[p'] > cache_len[p]:
10. T_transfer ← EstimateKVTransferTime(p, p', cache_len[p'])
11. T_candidate ← T_queue[p] + T_prefill_with_remote_cache + T_transfer
12. IF T_candidate < best_TTFT:
13. best_TTFT ← T_candidate
14. best_p ← p
15. best_cache_source ← p'
16. END FOR
17. ELSE:
18. T_candidate ← T_queue[p] + T_prefill[p]
19. IF T_candidate < best_TTFT:
20. best_TTFT ← T_candidate
21. best_p ← p
22. best_cache_source ← p // use local cache
23. END IF
24. END FOR
25. d* ← SelectDecodingInstance(D) // load-balanced selection
26. // SLO gate
27. IF best_TTFT > R.TTFT_SLO OR EstimateTBT(d*) > R.TBT_SLO:
28. RETURN REJECT
29. // Hot-spot mitigation: proactively replicate if cache is heavily skewed
30. IF best_cache_len[best_p] / len(R.prompt) > hot_threshold:
31. TriggerCacheReplication(best_cache_source → best_p)
32. RETURN (best_p, d*)
Design Choices and Why They Matter
Why balance cache hit rate against load? A naïve cache-aware scheduler would always route to the instance with the most matching prefix blocks, causing hot-spot concentration. Popular system prompts would make a few instances overwhelmed while others sit idle. The kvcache_balancing_threshold parameter controls this tradeoff: if the local cache hit would save less than a threshold fraction of prefill compute, the scheduler considers load-balancing alternatives.
Why estimate TTFT rather than just throughput? Pure throughput maximization ignores tail latency. A scheduler that packs as many tokens as possible into each GPU step will frequently violate TTFT SLOs for new arrivals. By estimating TTFT at scheduling time and comparing it against the SLO, the Conductor can reject requests early (before wasting GPU cycles on their prefill) when the system is overloaded.
Alternative — why not use a token-based priority queue? A simpler design (used in early vLLM) is to admit requests in FIFO order with a maximum batch size. This works well for uniform workloads but fails under long-context diversity: a single 128k-token request blocks thousands of short requests from starting. The Conductor’s per-request TTFT estimation allows heterogeneous workloads to coexist.
Chunked Pipeline Parallelism (CPP) for Multi-Node Prefill
For very long prompts (e.g., 128k tokens), a single GPU node may not have enough VRAM or compute to prefill in an acceptable time. Mooncake introduces Chunked Pipeline Parallelism (CPP), a novel parallelism strategy for the prefill phase.
graph LR
subgraph Input
Prompt["128k token prompt<br/>(split into chunks of ≥1000 tokens)"]
end
subgraph PipelineGroup["Pipeline Group: 3 prefill nodes"]
N1["Node 1<br/>Layers 1-26<br/>chunk 1"]
N2["Node 2<br/>Layers 27-53<br/>chunk 1"]
N3["Node 3<br/>Layers 54-80<br/>chunk 1"]
end
subgraph Time
T1["t=0: chunk1 → Node1<br/>t=1: chunk1 → Node2, chunk2 → Node1<br/>t=2: chunk1 → Node3, chunk2 → Node2, chunk3 → Node1<br/>..."]
end
Prompt --> N1
N1 -->|"hidden states<br/>(inter-layer comm)"| N2
N2 -->|"hidden states"| N3
Prompt --> T1
Figure 5 – Chunked Pipeline Parallelism. The prompt is divided into chunks; each chunk flows through a pipeline of nodes (each node handles a subset of transformer layers). While chunk 1 is at Node 2 (processing layers 27-53), chunk 2 begins at Node 1. This keeps all nodes busy and amortizes inter-node communication over many chunks.
CPP vs. Alternatives
| Design | Communication pattern | Drawback |
|---|---|---|
| Tensor Parallelism (all-reduce) | All-reduce at every attention layer across all nodes | Very high comm frequency; latency ∝ #nodes |
| Sequence Parallelism | Ring-based all-gather + reduce at each layer | Complex topology, sensitive to partition changes |
| CPP (Mooncake) | Hidden state transfer only at pipeline boundaries | Low freq; can overlap with KV streaming |
CPP’s key advantage is that cross-node communication only happens when hidden states cross pipeline boundaries (once per chunk per stage), not at every attention head. This makes communication overlappable with computation: while layer runs on Node 2, Node 1 streams its hidden states to Node 2 for the next chunk. The paper notes this is the first application of pipeline parallelism to the LLM inference (prefill) stage, as prior work used pipeline parallelism only for training.
Why CPP instead of Chunked Prefill + Continuous Batching?
Sarathi-Serve (2024) showed that “chunked prefill” (splitting a long prefill into sub-chunks that interleave with decode steps) reduces TTFT–TBT interference. Mooncake’s authors argue CPP is preferable for very long contexts because:
- Chunked prefill still runs prefill and decoding on the same GPU cluster, so they still compete for VRAM.
- CPP scales prefill capacity horizontally across nodes without touching the decoding cluster.
- The pipeline stages allow layer-by-layer KV streaming (a natural fit).
The boundary condition where CPP underperforms is short prompts: if the prompt is only 200 tokens, the overhead of spinning up a multi-node pipeline exceeds any parallelism benefit. CPP is specifically designed for contexts where a few thousand tokens.
Early Rejection Policy
A subtle but important engineering detail is the early rejection policy. Without it, the following bad scenario occurs repeatedly:
- Conductor schedules request to a prefill instance (based on estimated decoding load).
- Prefill takes 5 seconds for ‘s 128k-token prompt.
- By the time prefill finishes, the decoding instances are overloaded with other requests.
- The decoding instance rejects because its TBT SLO is already infeasible.
- The 5 seconds of prefill compute was wasted entirely.
flowchart TD
A[New request arrives] --> B{Prefix cache lookup}
B --> C[Estimate TTFT_candidate]
C --> D{TTFT_candidate ≤ TTFT_SLO?}
D -- No --> E[Early reject: TTFT infeasible]
D -- Yes --> F{Estimate decoding TBT feasibility}
F -- Infeasible --> G[Early reject: TBT infeasible]
F -- Feasible --> H[Schedule to prefill instance]
H --> I[Run prefill]
I --> J{Decoding instance accepts?}
J -- No: load changed --> K[Retry or reject]
J -- Yes --> L[Transfer KV + start decoding]
L --> M[Return tokens to client]
Figure 6 – Early rejection decision flow. By checking both TTFT and TBT feasibility before scheduling, the Conductor prevents wasted prefill compute and protects decoding SLOs.
Prediction-Based Early Rejection
The basic early rejection checks the current state of the decoding instance. A more sophisticated version — the prediction-based variant — accounts for load fluctuation. The paper observes that prefill and decoding loads oscillate anti-phase: when many prefill tasks complete simultaneously, they flood decoding with new requests, causing a TBT spike.
The prediction algorithm works as follows:
Algorithm: System-Level Decoding Load Prediction
State: Set of in-flight decode requests, each with start_time and expected duration t_d
FOR each future time t in [now, now + T_horizon]:
// Simulate which requests will finish vs. which new prefills will arrive
active_at_t = {r ∈ in_flight | r.start_time ≤ t ≤ r.start_time + t_d}
arriving_at_t = {r ∈ prefill_queue | r.expected_prefill_done ≈ t}
total_at_t = |active_at_t| + |arriving_at_t|
avg_TBT_ratio[t] = mean over decoding instances d of:
(total_at_t / d.capacity) × baseline_TBT
IF max(avg_TBT_ratio) > TBT_SLO:
REJECT this request
ELSE:
ACCEPT
The evaluation (Table 2) shows that:
- Baseline (no early rejection): 4,183 requests rejected by the decoding instance after wasted prefill
- With early rejection: 3,771 rejected (9.9% reduction in wasted compute)
- With early rejection + prediction: 3,589 rejected (14.2% reduction)
The prediction gains an additional 4.3% reduction because it catches the “burst arrivals from simultaneous prefill completions” pattern, which pure state-snapshot rejection misses.
Architecture Deep-Dive: Why Disaggregate?
A reasonable question is: why not use the simpler chunked-prefill approach that vLLM and Sarathi-Serve adopt? The paper provides two concrete arguments:
graph LR
subgraph vLLM["Coupled Architecture (vLLM)"]
G1[GPU 1<br/>Prefill + Decode]
G2[GPU 2<br/>Prefill + Decode]
G3[GPU 3<br/>Prefill + Decode]
G4[GPU 4<br/>Prefill + Decode]
end
subgraph Mooncake["Disaggregated Architecture (Mooncake)"]
direction LR
P1[Prefill GPU 1]
P2[Prefill GPU 2]
D1[Decode GPU 3]
D2[Decode GPU 4]
KV[KV Pool<br/>CPU DRAM]
P1 <-->|write KV| KV
P2 <-->|write KV| KV
D1 <-->|read KV| KV
D2 <-->|read KV| KV
end
Figure 7 – Coupled vs. disaggregated architecture. In vLLM, every GPU runs both prefill and decode. In Mooncake, GPU resources are dedicated to one phase, with KV state flowing through a shared pool.
Argument 1: Cross-node parallelism for long prefill. In a coupled system, the maximum prefill parallelism for a single request is bounded by the tensor parallelism degree within one model replica. With disaggregation, Mooncake can group multiple prefill nodes into a CPP pipeline, allowing a single request’s prefill to span many more GPUs without touching the decode cluster.
Argument 2: Layer-wise KV streaming. In a coupled system, KV tensors computed during prefill must remain in GPU VRAM until decoding is done (they cannot be offloaded mid-request without expensive re-computation). In the disaggregated model, prefill nodes stream KV layer-by-layer to CPU DRAM as soon as each layer is computed, freeing GPU VRAM for the next chunk. This makes VRAM occupation during prefill essentially independent of context length.
Where Disaggregation Hurts
The main cost of disaggregation is the KV transfer latency added to TTFT. For short prompts (e.g., 200 tokens), the KV size is tiny (~64 MB for LLaMA2-70B), so transfer takes ~1 ms — negligible. But for very long prompts with no cache hits, transfer is an unavoidable TTFT addition. The system mitigates this by overlapping transfer with the tail of prefill computation (layer-wise streaming).
A second cost is system complexity: managing a distributed KV cache pool, tracking block hashes, implementing RDMA transfers, and predicting transfer times all add operational overhead. The paper acknowledges this, noting that the kvcache_balancing_threshold is “currently adjusted manually” — a limitation that suggests room for a more automated adaptive algorithm.
Experimental Evaluation
Setup
- Hardware: 8 × NVIDIA A800-SXM4-80GB per node; 800 Gbps RDMA between nodes; NVLink within-node.
- Model: LLaMA2-70B architecture (dummy weights — actual Kimi model weights are proprietary, so synthetic inference is used for measurement consistency).
- Baselines: vLLM with continuous batching (representing the coupled prefill–decode state of the art).
Public Dataset Evaluation
Two document-heavy datasets:
| Dataset | Avg Input Tokens | Avg Output Tokens | Prefix Cache Hit Rate | Config | Throughput vs. vLLM |
|---|---|---|---|---|---|
| ArXiv Summarization | 8,088 | 229 | ~0% (unique docs) | Mooncake [3P+1D] vs. vLLM [4M] | +20% |
| L-Eval | 19,019 | 72 | >80% (shared context) | Mooncake [3P+1D] vs. vLLM [4M] | +40% |
The [3P+1D] notation means 3 prefill instances + 1 decoding instance in Mooncake; [4M] means 4 monolithic (coupled) instances in vLLM, giving both systems the same total GPU budget.
On ArXiv (no cache reuse), Mooncake still gains 20% from disaggregation alone — decoding is no longer disrupted by prefill. On L-Eval (>80% cache reuse), the cache prefix hit eliminates up to 80% of prefill compute, explaining the larger 40% gain.
Simulated Long-Context Scaling
xychart-beta
title "Mooncake vs. vLLM throughput improvement (50% cache rate, 512 output tokens)"
x-axis ["16k", "32k", "64k", "128k"]
y-axis "Throughput improvement (%)" 0 --> 600
bar [50, 120, 280, 525]
Figure 8 – Throughput gain scales with context length. For 16k-token inputs Mooncake is 50% faster; at 128k tokens the gain reaches 525%. The reason: vLLM cannot batch long-context requests (one request consumes too much VRAM and blocks others), while Mooncake’s disaggregated KV pool enables batching even at 128k context.
The gain increases super-linearly with context length because:
- vLLM’s throughput drops with longer contexts (one large request blocks many small ones).
- Mooncake’s throughput holds steady due to separate prefill scaling and cross-cluster KV caching.
Real Kimi Production Workload
The most compelling evaluation uses 23,000 real Kimi user requests replayed at original timestamps. Setup: Mooncake [10P+10D] vs. vLLM [20M] (20 monolithic instances).
| Metric | Mooncake | vLLM |
|---|---|---|
| TTFT P90 SLO compliance | ~100% | ~100% |
| TBT P90 SLO compliance | 100% | 57% |
| Effective throughput (goodput) | 75% more requests | baseline |
The TBT result is striking: vLLM violates TBT SLO for 43% of decode intervals. This is because long-context prefill preempts ongoing decode batches, causing irregular token generation timing. Mooncake, with fully separate decoding instances, maintains smooth TBT regardless of concurrent prefill activity.
Limitations and Boundary Conditions
Every design has a regime where it works best and conditions where it struggles:
1. Short-context, low-cache-reuse workloads. When prompts are short (<500 tokens) and there is no prefix sharing, disaggregation adds KV transfer overhead with little benefit (small KV, no cache reuse). A coupled serving architecture would be simpler and comparably efficient for this case.
2. The cold-start cache problem. On first startup or after a cache eviction, there are no cached blocks, and every request must fully prefill. The system provides no optimistic caching during warm-up; performance gradually improves as the cache fills. This is the classic “cold cache” problem in prefetching/caching systems.
3. Manual threshold tuning. The kvcache_balancing_threshold and other scheduling hyperparameters are currently set manually. In a production environment with changing workload distributions, sub-optimal thresholds may lead to either hot-spot concentration (too cache-aware) or excessive cache misses (too load-balanced). The paper acknowledges this and flags adaptive tuning as future work.
4. Output length prediction. The early rejection algorithm assumes a uniform per-request decoding time . Real workloads have highly variable output lengths. Accurately predicting output length would enable much more precise TBT SLO compliance estimation, but the paper notes this as “challenging due to high costs or low accuracy” with current prediction models.
5. Fault tolerance. The paper does not discuss what happens when a prefill instance or decoding instance fails mid-request. The KV cache pool (CPU DRAM / SSD) is a potential single point of failure if not replicated.
Worked Example: Conductor Scheduling a Long-Context Request
To make the algorithm concrete, let us trace through a complete scheduling decision for a single request.
Setup: 4 prefill instances (P1-P4), 2 decoding instances (D1-D2), LLaMA2-70B (320 KB KV/token), RDMA BW = 100 GB/s.
Incoming request R:
prompt_tokens: 32,768 tokens (32k context)max_tokens: 512TTFT_SLO: 10 secondsTBT_SLO: 0.1 sec/token
Step 1 — Prefix hash lookup: The Conductor generates 2,048 block hashes (32k tokens / 16 tokens per block). It queries each prefill instance’s cached block set.
| Instance | Cached block matches | cache_len (tokens) |
|---|---|---|
| P1 | 1,200 blocks matched | 19,200 tokens |
| P2 | 600 blocks | 9,600 tokens |
| P3 | 0 blocks | 0 tokens |
| P4 | 800 blocks | 12,800 tokens |
Step 2 — TTFT estimation for each instance:
Assumptions:
- Queue wait = 0.5 sec for all (equal load)
- Prefill rate = 2,000 tokens/sec (compute-bound, includes quadratic attention)
- Transfer rate for a block = 16 tokens × 320 KB = 5 MB per block; at 100 GB/s → ~0.05 ms per block
✓ (< 10s SLO)
✗ (> 10s SLO)
✗ (> 10s SLO)
✗ (> 10s SLO)
Step 3 — Remote cache transfer option for P4 (has second-best cache): If P4 transfers the missing 19,200 - 12,800 = 6,400 tokens worth of blocks from P1:
- Transfer size =
- Transfer time =
✓
Step 4 — Select best:
- P1: 7.28s ✓ (best with its own cache)
- P4 + P1 cache: 7.32s ✓ (slightly worse due to transfer)
Conductor selects P1 (local cache, 7.28s TTFT, meets SLO).
Step 5 — Decoding instance: D1 load = 60%, D2 load = 30% → select D2 (lighter load, TBT SLO feasible).
Step 6 — SLO gate passes. Conductor assigns (P1, D2) for this request.
This example illustrates why the Conductor needs both cache-awareness (P1’s large cache avoids 60% of prefill) and load-awareness (overloaded P2 would miss the TTFT SLO despite having 9.6k cached tokens).
KV Cache Eviction, Replication, and the Storage Hierarchy
A distributed KV cache pool introduces classic distributed systems problems: what happens when DRAM fills up? How do you avoid hotspots? How do you decide when to replicate versus evict?
Three-Tier Hierarchy
graph TD
subgraph GPU_VRAM["GPU VRAM (80 GB / node, ~1 TB/s BW)"]
ActiveKV["Active request KV blocks<br/>— lowest latency, smallest capacity"]
end
subgraph CPU_DRAM["CPU DRAM (1-2 TB / node, ~50 GB/s BW)"]
WarmKV["Warm KV blocks<br/>— recent prefix cache, frequent access"]
end
subgraph SSD["SSD (multi-TB / node, ~3 GB/s BW)"]
ColdKV["Cold KV blocks<br/>— infrequent access, long-term persistence"]
end
ActiveKV -- "eviction (idle request KV)" --> WarmKV
WarmKV -- "eviction (LRU/LFU)" --> ColdKV
ColdKV -- "on-demand load" --> WarmKV
WarmKV -- "on-demand load" --> ActiveKV
Figure A1 – KV cache storage hierarchy. Capacity increases 10-100× at each tier while bandwidth decreases proportionally. The Conductor’s scheduling algorithm accounts for read latency from each tier when estimating TTFT.
Eviction Policies
The paper evaluates three eviction policies for the CPU DRAM tier:
- LRU (Least Recently Used): Evict the block that was accessed least recently. Works well when access pattern has temporal locality (repeated conversations, common system prompts).
- LFU (Least Frequently Used): Evict the block with the fewest access counts. Better for workloads where certain prefixes are structurally popular regardless of recency.
- Request-characteristic-based: Use per-request metadata (remaining output budget, SLO tier, request priority) to guide eviction. Prioritize keeping KV blocks for high-value or nearly-complete requests.
In production, Mooncake uses a combination: LRU for general blocks, with a “pinned” category for blocks that are currently in-flight (being used by an active decode instance or being transferred).
Hot-Spot Replication
When the Conductor detects that a particular prefix block set is being requested by many concurrent requests (measured as best_cache_len / len(request) > hot_threshold), it proactively replicates those blocks to multiple prefill nodes:
Procedure: Hot-Spot Cache Replication
IF block_hashes[0..N] are requested by ≥ K concurrent requests within window W:
FOR each prefill node p that doesn't hold these blocks:
IF p has sufficient DRAM capacity:
TransferBlocks(source_node → p, block_hashes[0..N])
p.cached_blocks.add(block_hashes[0..N])
END FOR
// After replication, subsequent requests can be routed to local copies,
// avoiding repeated cross-node fetches and spreading the cache hit load.
This is analogous to CDN edge caching: once a piece of content becomes popular, it gets replicated to edge nodes closer to users. The threshold and window control the trade-off between replication bandwidth cost and cache hit rate improvement.
VRAM Budget Accounting
For the decoding cluster, each active request occupies a contiguous (or paged) slice of GPU VRAM for its KV blocks. The Conductor tracks VRAM utilization per decoding instance to avoid over-committing:
At scheduling time, the Conductor checks that adding this request’s expected KV footprint (up to max_tokens × KV_per_token) does not exceed the decoding instance’s VRAM budget minus a safety margin for the model weights themselves (which are fixed at ~140 GB for LLaMA2-70B with 8-way tensor parallelism across 8 GPUs).
System-Level Goodput Formulation
Mooncake’s evaluation metric is goodput rather than raw throughput: only requests that complete under their SLO constraints count toward goodput. This matters because a system that processes twice as many requests but violates SLOs for half of them is worse than a slower system with 100% SLO compliance.
The denominator is the observation window duration; the numerator counts only fully SLO-compliant completed requests.
Mooncake’s 75% higher goodput on the real Kimi workload means that under the same GPU fleet, Mooncake finishes 1.75× more requests within their latency targets. This directly translates to business value: either 43% fewer GPUs are needed for the same request volume, or the same fleet can serve 75% more users.
Throughput vs. Goodput: Why the Distinction Matters
Consider two systems A and B serving 100 requests/sec:
- System A: 100 req/s throughput, 80% SLO compliance → 80 req/s goodput
- System B: 85 req/s throughput, 100% SLO compliance → 85 req/s goodput
System B has lower throughput but higher goodput. In a production environment where SLO violations trigger customer escalations or refunds, system B is clearly preferable. The vLLM baseline achieves high raw throughput but only 57% TBT compliance, making its effective goodput much lower than the raw numbers suggest.
Mooncake in the Context of the Disaggregated Serving Landscape
The idea of disaggregating prefill and decoding did not originate with Mooncake. The contribution is in how disaggregation is organized and what system is built on top of it. Here is a brief comparison:
| System | Key Contribution | Prefill–Decode | KV Management |
|---|---|---|---|
| Orca (2022) | Iteration-level scheduling | Coupled | Local, per-request |
| vLLM (2023) | PagedAttention | Coupled | Local paged blocks |
| DistServe (2024) | Disaggregated P/D | Disaggregated | Local per-cluster |
| Sarathi-Serve (2024) | Chunked prefill | Coupled (interleaved) | Local |
| Mooncake (2024) | KV-centric scheduler + CPP + cross-cluster KV pool | Disaggregated | Distributed, multi-tier, cluster-wide |
| Llumnix (2024) | Live migration | Disaggregated | Distributed |
The distinguishing Mooncake contribution is elevating the KV cache pool from a local implementation detail to a globally managed, first-class distributed resource. This allows the scheduler to make cache-transfer cost a first-order variable in routing decisions, which none of the prior systems did.
Implementation Details
The paper briefly describes the Mooncake implementation:
- Language: The core serving engine is implemented in C++ with CUDA kernels for attention and MLP.
- KV transfer: The Messenger service uses a custom RDMA library built on top of the ibverbs API (InfiniBand verbs), with GPUDirect RDMA memory registration.
- Control plane: gRPC is used for Conductor-to-instance control messages (request assignment, cache metadata).
- Data plane: NCCL is used within a node for GPU-to-GPU communication (e.g., across tensor parallel ranks); custom RDMA is used across nodes.
- Cache metadata: A distributed hash map (implemented as a custom shard-per-node structure) stores the mapping from
block_hash → (node_id, memory_address, reference_count). - Block size: Typically 16 tokens per block, matching the granularity used by vLLM’s PagedAttention for compatibility.
The total system implementation is described as being in the range of tens of thousands of lines of C++ and Python (for the Conductor and configuration management). The open-source KV transfer engine is available at the GitHub repository kvcache-ai/Mooncake.
Impact on the Serving Ecosystem
Mooncake was published in June 2024 and received the FAST 2025 Best Paper Award, reflecting its significance to the storage and distributed systems community (not just the ML community). Its ideas have influenced subsequent systems:
- Llumnix (OSDI 2024): Takes disaggregated serving further with live migration of running decode requests between instances, enabling fine-grained load balancing and SLO priority enforcement.
- KV-Fold (2024): Recurrent KV compression for very long contexts — a complementary technique that reduces KV size before transfer.
- DistServe follow-on work: Explores optimal prefill:decode GPU ratio as a function of workload characteristics.
The core insight — that KV cache should be the primary scheduling resource, not a secondary consideration — is now widely accepted in the production LLM serving community.
Key Takeaways
Mooncake makes five contributions that are individually incremental but collectively represent a clean architectural design:
-
Disaggregated prefill–decode with a KV-centric lens: Not just physically separate clusters, but an entire philosophy shift where KV cache placement drives all scheduling decisions.
-
Distributed KV cache pool with hash-based prefix sharing: Extends prefix caching from a single-instance optimization to a cluster-wide resource, enabling dramatic cache hit rates for repeated-prefix workloads.
-
Chunked Pipeline Parallelism (CPP): The first application of pipeline parallelism to inference-phase prefill, enabling multi-node scaling for individual ultra-long-context requests.
-
Layer-wise async KV streaming: Overlaps computation with KV transfer to minimize TTFT overhead of disaggregation, and decouples VRAM requirements from context length.
-
TTFT-driven early rejection with load prediction: Prevents wasted prefill compute under overload, protecting both throughput (goodput) and tail latency SLOs simultaneously.
The 75% throughput improvement on real production workloads demonstrates that these architectural choices are not academic exercises but meaningful improvements for real-world LLM serving at scale.
Prefix Caching: The Mathematics of Cache-Hit Rate
The potential savings from prefix caching are substantial and worth quantifying precisely. Define the prefix cache hit rate as the fraction of input tokens whose KV blocks are already in the cache:
The compute savings from caching are:
where is the full prefill FLOP count for a prompt of length :
(The term is for QK attention; the term is for FFN/MLP with hidden ratio 4×.)
For L-Eval’s average input of 19,019 tokens with (80% cache hit) on LLaMA2-70B (, ):
This is a 5× reduction in prefill compute. Combined with reduced queue time at the prefill instances, this directly explains why L-Eval sees +40% throughput while ArXiv (ρ ≈ 0) sees only +20%.
The breakeven point for using a remote cache (paying transfer cost to gain compute savings) occurs when:
Substituting: , , :
Left side: (transfer cost per token)
Right side: (prefill cost per token)
Since , transferring a cached KV block is always faster than recomputing it, as long as RDMA bandwidth is available. This mathematical guarantee is the fundamental reason KV cache disaggregation is profitable in Mooncake’s hardware environment.
Open Research Questions
The paper leaves several intellectually interesting open problems:
-
Optimal prefill:decode GPU ratio. The paper uses fixed [3P+1D] or [10P+10D] configurations but does not derive the optimal ratio analytically. This depends on the workload’s input/output ratio and cache hit rate — an interesting optimization problem.
-
Adaptive threshold tuning. The
kvcache_balancing_thresholdis currently manual. An online bandit or control-theoretic approach could tune it automatically as the workload distribution shifts throughout the day. -
Output length prediction. Accurate per-request output length prediction would transform the early rejection policy from “reject if current load is high” to “reject if predicted cost exceeds capacity.” This would enable much more precise goodput maximization.
-
Hierarchical Conductor. The current Conductor is a centralized service. At very large scale (thousands of nodes), a hierarchical design (rack-level controllers reporting to a global Conductor) would improve scalability and reduce single-point-of-failure risk.
-
KV compression before transfer. If KV blocks are compressed (e.g., using quantization or SVD-based approximation) before RDMA transfer, the effective bandwidth is multiplied. Papers like KV-Fold explore this complementary direction.
Further Reading
- Orca (OSDI 2022): Iteration-level scheduling — the continuous batching foundation Mooncake builds on.
- vLLM / PagedAttention (SOSP 2023): Fine-grained KV paging that Mooncake adopts and extends.
- DistServe (OSDI 2024): Co-authored disaggregated serving work that influenced Mooncake’s prefill–decode separation.
- Sarathi-Serve (OSDI 2024): Chunked-prefill approach — a coupled alternative to Mooncake’s full disaggregation.
- Llumnix (OSDI 2024): Dynamic scheduling with live migration, extending Mooncake’s scheduler ideas.
- SGLang (MLSys 2025): RadixAttention and efficient structured generation — complementary to Mooncake’s KV caching.
- Mooncake open-source repository: The KV transfer engine (Messenger) is open-sourced at
github.com/kvcache-ai/Mooncake, enabling the community to reuse the RDMA-based cross-node KV transfer primitives in other serving systems.
This review is based on Mooncake arXiv:2407.00079 (Qin et al., 2024), winner of the FAST 2025 Best Paper Award. All performance numbers are from the paper’s experimental evaluation on A800 GPU clusters using LLaMA2-70B architecture with dummy weights. The open-source KV transfer engine is available at github.com/kvcache-ai/Mooncake.
Reviewed by Zhongzhu Zhou, 2026-05-28.