Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Review date: 2026-05-28 Review author: Zhongzhu Zhou Paper reviewed: Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Paper authors: Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu arXiv: 2407.00079 Status: FAST 2025 Best Paper Award; ACM Transactions on Storage

Short Answer

Mooncake is Kimi’s production LLM serving platform, built around a single insight: the KV cache is the central scarce resource in LLM serving, so the entire system architecture should be organized around managing, reusing, and moving it efficiently. The paper proposes disaggregating the prefill cluster from the decoding cluster, maintains a distributed KV cache pool in CPU DRAM and SSD across the entire fleet, and uses a global “Conductor” scheduler that makes every scheduling decision in terms of KV cache placement and transfer cost. On real Kimi workloads, this architecture handles 75% more requests than a vLLM baseline at the same SLO thresholds, and achieves up to 525% throughput improvement in simulated long-context scenarios.

Prerequisites

This review assumes familiarity with the basics of transformer inference. Here is a compact recap of everything you will need before the paper’s innovations make sense.

1. Transformer Autoregressive Generation

A decoder-only language model like GPT generates text token-by-token. Given an input prompt of LL tokens, the model first runs a prefill (also called initiation) pass: it processes all LL prompt tokens in parallel and caches their computed key-value tensors. Then, in the decoding (also called increment) phase, the model generates one new token per forward pass, each time attending over all previous tokens. Because all previous key and value tensors are cached (the KV cache), each decode step only computes attention for a single new query vector.

2. KV Cache Structure

In each transformer layer, every token produces a key vector kRdkk \in \mathbb{R}^{d_k} and a value vector vRdvv \in \mathbb{R}^{d_v}. For a model with nlayersn_{\text{layers}} layers, nheadsn_{\text{heads}} attention heads of dimension dheadd_{\text{head}}, and half-precision (FP16) storage, the KV cache for a single token takes:

KVtoken=2×nlayers×nheads×dhead×2 bytes\text{KV}_{\text{token}} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times 2 \text{ bytes}

For LLaMA2-70B (nlayers=80n_{\text{layers}} = 80, nheads=64n_{\text{heads}} = 64, dhead=128d_{\text{head}} = 128, GQA with 8 KV heads):

KVtoken=2×80×8×128×2=327,680 bytes320 KB per token\text{KV}_{\text{token}} = 2 \times 80 \times 8 \times 128 \times 2 = 327{,}680 \text{ bytes} \approx 320 \text{ KB per token}

A 128k-token context therefore requires about 40 GB of KV cache — half the VRAM of an A100-80GB.

3. Prefill Compute Complexity

In the prefill pass, computing self-attention for LL prompt tokens has complexity:

  • Attention: O(L2d)O(L^2 \cdot d) — quadratic in sequence length
  • MLP / FFN: O(Ldffn)O(L \cdot d_{\text{ffn}}) — linear in sequence length

This quadratic scaling means that doubling the input length quadruples the attention compute in the prefill pass. For very long contexts (16k–128k tokens), prefill dominates GPU time and blocks the decoding cluster.

4. Service-Level Objectives (SLOs)

Production LLM services define two key latency metrics:

  • TTFT (Time to First Token): Time from request arrival to the first generated token. This is dominated by the prefill phase.
  • TBT (Time Between Tokens): Time between consecutive generated tokens during decoding. This is dominated by memory bandwidth (reading the full KV cache at each decode step).

These two objectives are naturally in tension: batching many requests together improves throughput but increases individual request latency.

5. Continuous Batching (Orca, 2022)

Before Mooncake, the state of the art was Orca’s iteration-level scheduling: instead of running a fixed batch of requests from start to finish, the scheduler picks requests at every generation step, letting finished requests exit and new requests enter dynamically. This eliminated the “batch waits for the slowest request” bottleneck and dramatically improved GPU utilization.

However, Orca couples prefill and decoding in the same cluster — a long prefill for a new request can stall ongoing decoding iterations, degrading TBT latency. Mooncake removes this coupling.

6. Disaggregated Prefill–Decode (DistServe, 2024)

DistServe (reviewed separately) showed that physically separating prefill and decoding into different GPU pools is beneficial: the two phases have very different hardware requirements (compute-intensive vs. memory-bandwidth-intensive), and separating them prevents interference. Mooncake builds directly on this idea, but adds the KV cache as a first-class architectural component.

7. PagedAttention and KV Cache Management

vLLM introduced PagedAttention, which stores KV caches in non-contiguous, fixed-size “pages” (blocks) rather than requiring contiguous GPU memory. This eliminates internal fragmentation and allows fine-grained memory sharing. Mooncake extends this idea to a distributed, hierarchical (GPU VRAM → CPU DRAM → SSD) KV cache pool spanning the entire cluster.

8. RDMA Networking

RDMA (Remote Direct Memory Access) allows one node to read from or write to another node’s memory without involving the remote CPU. RDMA NIC hardware drives the transfer directly, at near-network-wire-speed with very low CPU overhead. Modern GPU clusters use RDMA fabrics (InfiniBand or RoCE) between nodes; within a node, NVLink connects GPUs. Mooncake’s Transfer Engine exploits GPUDirect RDMA: the GPU memory is directly exposed to the RDMA fabric, so KV tensors can move GPU-to-GPU across nodes without a CPU bounce copy.

Why Existing Systems Fall Short

The Coupled Prefill–Decode Problem

gantt
    title Coupled Prefill-Decode (vLLM / Orca style)
    dateFormat  X
    axisFormat  %s
    section Request A (long input)
        Prefill (64k tokens)  :active, a1, 0, 120
        Decode step 1         :a2, 120, 130
        Decode step 2         :a3, 130, 140
    section Request B (arrives at t=50)
        Waiting for prefill to finish :crit, b1, 50, 120
        Prefill (short)       :b2, 120, 130
        Decode step 1         :b3, 130, 140
    section Request C (decode-only)
        Decode step 1 (stalled) :crit, c1, 50, 120
        Decode step 2         :c2, 120, 130
        Decode step 3         :c3, 130, 140

Figure 1 – Coupled interference. When a long-input request starts its prefill (120 time-units), all decode-only requests (Request C) and new arrivals (Request B) are blocked. Request C’s TBT SLO may be violated even though it needs no prefill at all.

In a production system with diverse context lengths (some requests have 128k-token prompts, others have 200-token prompts), this interference is severe. The paper reports that on the real Kimi workload, vLLM achieves only 57% TBT SLO compliance when long-context requests are in flight — 43% of decode intervals miss their latency target.

The Memory Reservation Problem

Orca-style schedulers reserve KV cache GPU memory at request admission time based on max_tokens. If a request is budgeted 2,048 output tokens but finishes in 30, those 2,017 token slots of VRAM sit idle for the duration. This over-reservation leads to low effective VRAM utilization and conservative admission control, further limiting throughput.

The Reuse Opportunity

A critical insight from production data: many requests share common prefixes. System prompts, few-shot examples, and popular document prefixes repeat across thousands of user requests. If the KV cache for a common prefix were stored once and reused, the compute for those tokens would be entirely avoided. vLLM’s RadixAttention / prefix caching implements a local version of this idea, but it only reuses KV cache within a single serving instance. Mooncake extends prefix caching across the entire cluster.

Mooncake Architecture Overview

graph TB
    subgraph Client
        R[Request Stream]
    end

    subgraph Conductor["Conductor (Global Scheduler)"]
        C[Scheduler Logic<br/>TTFT Estimation<br/>SLO Validation<br/>Early Rejection]
    end

    subgraph PrefillPool["Prefill Pool (compute-optimized)"]
        P1[Prefill Instance 1]
        P2[Prefill Instance 2]
        P3[Prefill Instance N]
    end

    subgraph DecodingPool["Decoding Pool (memory-optimized)"]
        D1[Decoding Instance 1]
        D2[Decoding Instance 2]
        D3[Decoding Instance M]
    end

    subgraph KVStore["Distributed KV Cache Pool"]
        GPU[GPU VRAM<br/>active requests]
        DRAM[CPU DRAM<br/>cached blocks]
        SSD[SSD<br/>cold blocks]
    end

    subgraph Messenger["Transfer Engine (Messenger)"]
        RDMA[GPUDirect RDMA<br/>800 Gbps]
    end

    R -->|route| C
    C -->|assign prefill| PrefillPool
    C -->|assign decode| DecodingPool
    PrefillPool <-->|KV read/write| DRAM
    DecodingPool <-->|KV read| DRAM
    DRAM <-->|eviction| SSD
    PrefillPool <-->|transfer| RDMA
    RDMA <-->|transfer| DecodingPool
    PrefillPool <-->|local| GPU
    DecodingPool <-->|local| GPU

Figure 2 – Mooncake system architecture. The Conductor is the global brain; the Prefill Pool and Decoding Pool are physically separate GPU clusters; the distributed KV cache pool spans CPU DRAM and SSD; the Transfer Engine (Messenger) moves KV blocks across machines via RDMA.

The KV Cache as First-Class Citizen

The fundamental shift Mooncake makes is conceptual: instead of treating KV cache as a side-effect of serving requests, it makes KV cache the primary resource around which scheduling, placement, and data movement are organized.

Every scheduled request follows this pipeline:

sequenceDiagram
    participant Client
    participant Conductor
    participant KVPool as KV Cache Pool (CPU DRAM)
    participant Prefill as Prefill Instance
    participant Messenger as Transfer Engine
    participant Decode as Decoding Instance

    Client->>Conductor: New request (prompt_tokens, max_tokens, SLO)
    Conductor->>KVPool: Prefix hash lookup (find cached prefix blocks)
    KVPool-->>Conductor: cache_hit_len (# tokens already cached)
    Conductor->>Conductor: Estimate TTFT = T_queue + T_prefill(L - cache_hit_len) + T_transfer
    Conductor->>Conductor: SLO check (TTFT < TTFT_SLO? TBT feasible?)
    alt SLO feasible
        Conductor->>Prefill: Assign request + cached prefix info
        Prefill->>KVPool: Load prefix KV blocks into GPU VRAM (async)
        Prefill->>Prefill: Compute prefill for remaining tokens (incremental)
        Prefill->>KVPool: Store new KV blocks (hash, layer-by-layer)
        Prefill->>Messenger: Stream KV blocks to decoding node
        Messenger->>Decode: Transfer KV blocks via RDMA
        Decode->>Decode: Begin autoregressive decoding
        Decode->>Client: Stream generated tokens
    else SLO infeasible
        Conductor->>Client: Early reject
    end

Figure 3 – Request lifecycle in Mooncake. The critical path from request arrival to first token includes a prefix cache lookup, incremental prefill (skipping already-cached tokens), KV transfer, and decoding.

KV Cache Storage and Hashing

KV cache blocks are stored as paged units — fixed-size blocks (e.g., 16 tokens worth of KV tensors per block) — across the hierarchy. Each block is tagged with a hash key derived from its token content:

hblock=Hash(parent_hash,tokens[start:end])h_{\text{block}} = \text{Hash}\left(\text{parent\_hash}, \text{tokens}_{[start:end]}\right)

This content-addressed scheme enables deduplication: if two requests share a common prefix, their prefix blocks have identical hashes and can be identified as the same data without comparing the raw tensors. The Conductor uses these hashes for two purposes:

  1. Prefix matching at scheduling time: Find which prefill instance already has the most blocks matching this request’s prefix.
  2. Cache eviction / replication: Track which blocks are hot (many hits) and replicate them; evict cold blocks to SSD.

Transfer Engine: Moving KV Across the Cluster

The Transfer Engine (called “Messenger” in the paper) is the component that makes KV cache disaggregation practical. Without fast KV transfer, the cost of sending computed KV tensors from a prefill node to a decoding node would exceed the savings from prefix caching and disaggregation.

Hardware: GPUDirect RDMA

GPUDirect RDMA allows a GPU’s HBM to be directly exposed to the RDMA fabric. When a prefill node sends a KV block to a decoding node, the RDMA NIC reads directly from GPU HBM without involving the host CPU or PCIe. Key characteristics of the testbed:

  • 800 Gbps RDMA between nodes (InfiniBand HDR or RoCE v2)
  • NVLink within-node (up to 900 GB/s aggregate within an 8-GPU node)
  • 8 × A800-SXM4-80GB GPUs per node (80 GB HBM each)

Transfer time for a KV block of size SS bytes is approximately:

Ttransfer=SBWRDMA+TlatencyT_{\text{transfer}} = \frac{S}{BW_{\text{RDMA}}} + T_{\text{latency}}

For a LLaMA2-70B 16k-token context: S=16,384×320KB5GBS = 16{,}384 \times 320\,\text{KB} \approx 5\,\text{GB}, BW=100GB/sBW = 100\,\text{GB/s} (800 Gbps), so Ttransfer50msT_{\text{transfer}} \approx 50\,\text{ms}. This is acceptable for a TTFT SLO of 30 seconds.

Layer-wise Streaming and Overlap

A naive approach would complete the entire prefill, then transfer all KV blocks to the decoding node, then start decoding. This wastes time: decoding cannot begin until all transfers finish. Mooncake uses layer-wise async streaming:

gantt
    title Layer-wise Prefill KV Transfer (Overlap)
    dateFormat  X
    axisFormat  %s
    section Prefill Compute
        Layer 1-20 compute   :a1, 0, 20
        Layer 21-40 compute  :a2, 20, 40
        Layer 41-60 compute  :a3, 40, 60
        Layer 61-80 compute  :a4, 60, 80
    section KV Transfer (async)
        Transfer layers 1-20 KV  :b1, 20, 35
        Transfer layers 21-40 KV :b2, 40, 55
        Transfer layers 41-60 KV :b3, 60, 75
        Transfer layers 61-80 KV :b4, 80, 95
    section Decoding (starts early)
        Wait for all layers  :crit, c1, 0, 95
        Decode step 1        :c2, 95, 100

Figure 4 – Layer-wise KV transfer overlap. Prefill computes each group of layers, then immediately starts streaming those KV tensors to the decoding node. The total transfer completes shortly after prefill finishes, reducing end-to-end TTFT compared to a monolithic (compute-all then transfer-all) approach.

The paper shows that for 128k-token contexts, layer-wise overlap reduces KV store latency by 30–60% compared to naive bulk transfer.

A second benefit of layer-wise prefill is VRAM independence: because KV tensors are streamed out as they are computed, the prefill GPU does not need to hold all 80 layers’ worth of KV in VRAM simultaneously. It only needs to buffer one layer group at a time. This means prefill can handle arbitrarily long contexts as long as VRAM can hold a single layer group — a major scalability improvement.

The Conductor: Global KV-Cache-centric Scheduler

The Conductor is the central brain of Mooncake. It runs as a separate service and makes all request routing decisions. Its core algorithm is reproduced and expanded below.

Inputs and State

  • P={p1,,pN}P = \{p_1, \ldots, p_N\}: the set of prefill instances
  • D={d1,,dM}D = \{d_1, \ldots, d_M\}: the set of decoding instances
  • RR: an incoming request with fields:
    • prompt_tokens: token IDs of the input prompt
    • max_tokens: upper bound on output length
    • TTFT_SLO, TBT_SLO: latency requirements
  • cache_block_size BB: number of tokens per KV block

Step-by-Step Algorithm

Algorithm: KVCache-Centric Scheduling (Conductor)

Input: P (prefill pool), D (decode pool), R (request), B (block size)
Output: (p*, d*) = chosen (prefill_instance, decoding_instance), or REJECT

1.  block_hashes ← PrefixHash(R.prompt_tokens, B)
    // Generate the sequence of block hash keys for R's prompt.
    // block_hashes[i] is the hash of the i-th KV block.

2.  best_TTFT ← ∞
    best_p ← None
    best_cache_len ← 0

3.  FOR each prefill instance p ∈ P:
4.      // Determine how many prefix blocks this instance already holds
5.      cache_len[p] ← PrefixMatch(p.cached_blocks, block_hashes)
        // cache_len[p] = number of tokens already cached at instance p

6.      T_queue[p] ← EstimatePrefillQueueTime(p)
        // Based on current queue depth and estimated per-token prefill speed

7.      T_prefill[p] ← EstimatePrefillExecTime(len(R.prompt) - cache_len[p])
        // Prefill exec time is linear in # uncached tokens (ignoring quadratic
        // attention for simplicity; actual model accounts for full attention)

8.      IF cache_len[p] < kvcache_balancing_threshold × len(R.prompt):
            // Not enough local cache to justify loading from here;
            // look for a remote source to transfer from
9.          FOR each other instance p' with cache_len[p'] > cache_len[p]:
10.             T_transfer ← EstimateKVTransferTime(p, p', cache_len[p'])
11.             T_candidate ← T_queue[p] + T_prefill_with_remote_cache + T_transfer
12.             IF T_candidate < best_TTFT:
13.                 best_TTFT ← T_candidate
14.                 best_p ← p
15.                 best_cache_source ← p'
16.         END FOR
17.     ELSE:
18.         T_candidate ← T_queue[p] + T_prefill[p]
19.         IF T_candidate < best_TTFT:
20.             best_TTFT ← T_candidate
21.             best_p ← p
22.             best_cache_source ← p   // use local cache
23.     END IF
24. END FOR

25. d* ← SelectDecodingInstance(D)   // load-balanced selection

26. // SLO gate
27. IF best_TTFT > R.TTFT_SLO OR EstimateTBT(d*) > R.TBT_SLO:
28.     RETURN REJECT

29. // Hot-spot mitigation: proactively replicate if cache is heavily skewed
30. IF best_cache_len[best_p] / len(R.prompt) > hot_threshold:
31.     TriggerCacheReplication(best_cache_source → best_p)

32. RETURN (best_p, d*)

Design Choices and Why They Matter

Why balance cache hit rate against load? A naïve cache-aware scheduler would always route to the instance with the most matching prefix blocks, causing hot-spot concentration. Popular system prompts would make a few instances overwhelmed while others sit idle. The kvcache_balancing_threshold parameter controls this tradeoff: if the local cache hit would save less than a threshold fraction of prefill compute, the scheduler considers load-balancing alternatives.

Why estimate TTFT rather than just throughput? Pure throughput maximization ignores tail latency. A scheduler that packs as many tokens as possible into each GPU step will frequently violate TTFT SLOs for new arrivals. By estimating TTFT at scheduling time and comparing it against the SLO, the Conductor can reject requests early (before wasting GPU cycles on their prefill) when the system is overloaded.

Alternative — why not use a token-based priority queue? A simpler design (used in early vLLM) is to admit requests in FIFO order with a maximum batch size. This works well for uniform workloads but fails under long-context diversity: a single 128k-token request blocks thousands of short requests from starting. The Conductor’s per-request TTFT estimation allows heterogeneous workloads to coexist.

Chunked Pipeline Parallelism (CPP) for Multi-Node Prefill

For very long prompts (e.g., 128k tokens), a single GPU node may not have enough VRAM or compute to prefill in an acceptable time. Mooncake introduces Chunked Pipeline Parallelism (CPP), a novel parallelism strategy for the prefill phase.

graph LR
    subgraph Input
        Prompt["128k token prompt<br/>(split into chunks of ≥1000 tokens)"]
    end
    subgraph PipelineGroup["Pipeline Group: 3 prefill nodes"]
        N1["Node 1<br/>Layers 1-26<br/>chunk 1"]
        N2["Node 2<br/>Layers 27-53<br/>chunk 1"]
        N3["Node 3<br/>Layers 54-80<br/>chunk 1"]
    end
    subgraph Time
        T1["t=0: chunk1 → Node1<br/>t=1: chunk1 → Node2, chunk2 → Node1<br/>t=2: chunk1 → Node3, chunk2 → Node2, chunk3 → Node1<br/>..."]
    end
    Prompt --> N1
    N1 -->|"hidden states<br/>(inter-layer comm)"| N2
    N2 -->|"hidden states"| N3
    Prompt --> T1

Figure 5 – Chunked Pipeline Parallelism. The prompt is divided into chunks; each chunk flows through a pipeline of nodes (each node handles a subset of transformer layers). While chunk 1 is at Node 2 (processing layers 27-53), chunk 2 begins at Node 1. This keeps all nodes busy and amortizes inter-node communication over many chunks.

CPP vs. Alternatives

DesignCommunication patternDrawback
Tensor Parallelism (all-reduce)All-reduce at every attention layer across all nodesVery high comm frequency; latency ∝ #nodes
Sequence ParallelismRing-based all-gather + reduce at each layerComplex topology, sensitive to partition changes
CPP (Mooncake)Hidden state transfer only at pipeline boundariesLow freq; can overlap with KV streaming

CPP’s key advantage is that cross-node communication only happens when hidden states cross pipeline boundaries (once per chunk per stage), not at every attention head. This makes communication overlappable with computation: while layer kk runs on Node 2, Node 1 streams its hidden states to Node 2 for the next chunk. The paper notes this is the first application of pipeline parallelism to the LLM inference (prefill) stage, as prior work used pipeline parallelism only for training.

Why CPP instead of Chunked Prefill + Continuous Batching?

Sarathi-Serve (2024) showed that “chunked prefill” (splitting a long prefill into sub-chunks that interleave with decode steps) reduces TTFT–TBT interference. Mooncake’s authors argue CPP is preferable for very long contexts because:

  1. Chunked prefill still runs prefill and decoding on the same GPU cluster, so they still compete for VRAM.
  2. CPP scales prefill capacity horizontally across nodes without touching the decoding cluster.
  3. The pipeline stages allow layer-by-layer KV streaming (a natural fit).

The boundary condition where CPP underperforms is short prompts: if the prompt is only 200 tokens, the overhead of spinning up a multi-node pipeline exceeds any parallelism benefit. CPP is specifically designed for contexts where L>L > a few thousand tokens.

Early Rejection Policy

A subtle but important engineering detail is the early rejection policy. Without it, the following bad scenario occurs repeatedly:

  1. Conductor schedules request RR to a prefill instance (based on estimated decoding load).
  2. Prefill takes 5 seconds for RR‘s 128k-token prompt.
  3. By the time prefill finishes, the decoding instances are overloaded with other requests.
  4. The decoding instance rejects RR because its TBT SLO is already infeasible.
  5. The 5 seconds of prefill compute was wasted entirely.
flowchart TD
    A[New request arrives] --> B{Prefix cache lookup}
    B --> C[Estimate TTFT_candidate]
    C --> D{TTFT_candidate ≤ TTFT_SLO?}
    D -- No --> E[Early reject: TTFT infeasible]
    D -- Yes --> F{Estimate decoding TBT feasibility}
    F -- Infeasible --> G[Early reject: TBT infeasible]
    F -- Feasible --> H[Schedule to prefill instance]
    H --> I[Run prefill]
    I --> J{Decoding instance accepts?}
    J -- No: load changed --> K[Retry or reject]
    J -- Yes --> L[Transfer KV + start decoding]
    L --> M[Return tokens to client]

Figure 6 – Early rejection decision flow. By checking both TTFT and TBT feasibility before scheduling, the Conductor prevents wasted prefill compute and protects decoding SLOs.

Prediction-Based Early Rejection

The basic early rejection checks the current state of the decoding instance. A more sophisticated version — the prediction-based variant — accounts for load fluctuation. The paper observes that prefill and decoding loads oscillate anti-phase: when many prefill tasks complete simultaneously, they flood decoding with new requests, causing a TBT spike.

The prediction algorithm works as follows:

Algorithm: System-Level Decoding Load Prediction

State: Set of in-flight decode requests, each with start_time and expected duration t_d

FOR each future time t in [now, now + T_horizon]:
    // Simulate which requests will finish vs. which new prefills will arrive
    active_at_t = {r ∈ in_flight | r.start_time ≤ t ≤ r.start_time + t_d}
    arriving_at_t = {r ∈ prefill_queue | r.expected_prefill_done ≈ t}
    total_at_t = |active_at_t| + |arriving_at_t|
    
    avg_TBT_ratio[t] = mean over decoding instances d of:
                       (total_at_t / d.capacity) × baseline_TBT

IF max(avg_TBT_ratio) > TBT_SLO:
    REJECT this request
ELSE:
    ACCEPT

The evaluation (Table 2) shows that:

  • Baseline (no early rejection): 4,183 requests rejected by the decoding instance after wasted prefill
  • With early rejection: 3,771 rejected (9.9% reduction in wasted compute)
  • With early rejection + prediction: 3,589 rejected (14.2% reduction)

The prediction gains an additional 4.3% reduction because it catches the “burst arrivals from simultaneous prefill completions” pattern, which pure state-snapshot rejection misses.

Architecture Deep-Dive: Why Disaggregate?

A reasonable question is: why not use the simpler chunked-prefill approach that vLLM and Sarathi-Serve adopt? The paper provides two concrete arguments:

graph LR
    subgraph vLLM["Coupled Architecture (vLLM)"]
        G1[GPU 1<br/>Prefill + Decode]
        G2[GPU 2<br/>Prefill + Decode]
        G3[GPU 3<br/>Prefill + Decode]
        G4[GPU 4<br/>Prefill + Decode]
    end

    subgraph Mooncake["Disaggregated Architecture (Mooncake)"]
        direction LR
        P1[Prefill GPU 1]
        P2[Prefill GPU 2]
        D1[Decode GPU 3]
        D2[Decode GPU 4]
        KV[KV Pool<br/>CPU DRAM]
        P1 <-->|write KV| KV
        P2 <-->|write KV| KV
        D1 <-->|read KV| KV
        D2 <-->|read KV| KV
    end

Figure 7 – Coupled vs. disaggregated architecture. In vLLM, every GPU runs both prefill and decode. In Mooncake, GPU resources are dedicated to one phase, with KV state flowing through a shared pool.

Argument 1: Cross-node parallelism for long prefill. In a coupled system, the maximum prefill parallelism for a single request is bounded by the tensor parallelism degree within one model replica. With disaggregation, Mooncake can group multiple prefill nodes into a CPP pipeline, allowing a single request’s prefill to span many more GPUs without touching the decode cluster.

Argument 2: Layer-wise KV streaming. In a coupled system, KV tensors computed during prefill must remain in GPU VRAM until decoding is done (they cannot be offloaded mid-request without expensive re-computation). In the disaggregated model, prefill nodes stream KV layer-by-layer to CPU DRAM as soon as each layer is computed, freeing GPU VRAM for the next chunk. This makes VRAM occupation during prefill essentially independent of context length.

Where Disaggregation Hurts

The main cost of disaggregation is the KV transfer latency added to TTFT. For short prompts (e.g., 200 tokens), the KV size is tiny (~64 MB for LLaMA2-70B), so transfer takes ~1 ms — negligible. But for very long prompts with no cache hits, transfer is an unavoidable TTFT addition. The system mitigates this by overlapping transfer with the tail of prefill computation (layer-wise streaming).

A second cost is system complexity: managing a distributed KV cache pool, tracking block hashes, implementing RDMA transfers, and predicting transfer times all add operational overhead. The paper acknowledges this, noting that the kvcache_balancing_threshold is “currently adjusted manually” — a limitation that suggests room for a more automated adaptive algorithm.

Experimental Evaluation

Setup

  • Hardware: 8 × NVIDIA A800-SXM4-80GB per node; 800 Gbps RDMA between nodes; NVLink within-node.
  • Model: LLaMA2-70B architecture (dummy weights — actual Kimi model weights are proprietary, so synthetic inference is used for measurement consistency).
  • Baselines: vLLM with continuous batching (representing the coupled prefill–decode state of the art).

Public Dataset Evaluation

Two document-heavy datasets:

DatasetAvg Input TokensAvg Output TokensPrefix Cache Hit RateConfigThroughput vs. vLLM
ArXiv Summarization8,088229~0% (unique docs)Mooncake [3P+1D] vs. vLLM [4M]+20%
L-Eval19,01972>80% (shared context)Mooncake [3P+1D] vs. vLLM [4M]+40%

The [3P+1D] notation means 3 prefill instances + 1 decoding instance in Mooncake; [4M] means 4 monolithic (coupled) instances in vLLM, giving both systems the same total GPU budget.

On ArXiv (no cache reuse), Mooncake still gains 20% from disaggregation alone — decoding is no longer disrupted by prefill. On L-Eval (>80% cache reuse), the cache prefix hit eliminates up to 80% of prefill compute, explaining the larger 40% gain.

Simulated Long-Context Scaling

xychart-beta
    title "Mooncake vs. vLLM throughput improvement (50% cache rate, 512 output tokens)"
    x-axis ["16k", "32k", "64k", "128k"]
    y-axis "Throughput improvement (%)" 0 --> 600
    bar [50, 120, 280, 525]

Figure 8 – Throughput gain scales with context length. For 16k-token inputs Mooncake is 50% faster; at 128k tokens the gain reaches 525%. The reason: vLLM cannot batch long-context requests (one request consumes too much VRAM and blocks others), while Mooncake’s disaggregated KV pool enables batching even at 128k context.

The gain increases super-linearly with context length because:

  1. vLLM’s throughput drops with longer contexts (one large request blocks many small ones).
  2. Mooncake’s throughput holds steady due to separate prefill scaling and cross-cluster KV caching.

Real Kimi Production Workload

The most compelling evaluation uses 23,000 real Kimi user requests replayed at original timestamps. Setup: Mooncake [10P+10D] vs. vLLM [20M] (20 monolithic instances).

MetricMooncakevLLM
TTFT P90 SLO compliance~100%~100%
TBT P90 SLO compliance100%57%
Effective throughput (goodput)75% more requestsbaseline

The TBT result is striking: vLLM violates TBT SLO for 43% of decode intervals. This is because long-context prefill preempts ongoing decode batches, causing irregular token generation timing. Mooncake, with fully separate decoding instances, maintains smooth TBT regardless of concurrent prefill activity.

Limitations and Boundary Conditions

Every design has a regime where it works best and conditions where it struggles:

1. Short-context, low-cache-reuse workloads. When prompts are short (<500 tokens) and there is no prefix sharing, disaggregation adds KV transfer overhead with little benefit (small KV, no cache reuse). A coupled serving architecture would be simpler and comparably efficient for this case.

2. The cold-start cache problem. On first startup or after a cache eviction, there are no cached blocks, and every request must fully prefill. The system provides no optimistic caching during warm-up; performance gradually improves as the cache fills. This is the classic “cold cache” problem in prefetching/caching systems.

3. Manual threshold tuning. The kvcache_balancing_threshold and other scheduling hyperparameters are currently set manually. In a production environment with changing workload distributions, sub-optimal thresholds may lead to either hot-spot concentration (too cache-aware) or excessive cache misses (too load-balanced). The paper acknowledges this and flags adaptive tuning as future work.

4. Output length prediction. The early rejection algorithm assumes a uniform per-request decoding time tdt_d. Real workloads have highly variable output lengths. Accurately predicting output length would enable much more precise TBT SLO compliance estimation, but the paper notes this as “challenging due to high costs or low accuracy” with current prediction models.

5. Fault tolerance. The paper does not discuss what happens when a prefill instance or decoding instance fails mid-request. The KV cache pool (CPU DRAM / SSD) is a potential single point of failure if not replicated.

Worked Example: Conductor Scheduling a Long-Context Request

To make the algorithm concrete, let us trace through a complete scheduling decision for a single request.

Setup: 4 prefill instances (P1-P4), 2 decoding instances (D1-D2), LLaMA2-70B (320 KB KV/token), RDMA BW = 100 GB/s.

Incoming request R:

  • prompt_tokens: 32,768 tokens (32k context)
  • max_tokens: 512
  • TTFT_SLO: 10 seconds
  • TBT_SLO: 0.1 sec/token

Step 1 — Prefix hash lookup: The Conductor generates 2,048 block hashes (32k tokens / 16 tokens per block). It queries each prefill instance’s cached block set.

InstanceCached block matchescache_len (tokens)
P11,200 blocks matched19,200 tokens
P2600 blocks9,600 tokens
P30 blocks0 tokens
P4800 blocks12,800 tokens

Step 2 — TTFT estimation for each instance:

Assumptions:

  • Queue wait = 0.5 sec for all (equal load)
  • Prefill rate = 2,000 tokens/sec (compute-bound, includes quadratic attention)
  • Transfer rate for a block = 16 tokens × 320 KB = 5 MB per block; at 100 GB/s → ~0.05 ms per block

TTTFT(P1)=0.5+32768192002000=0.5+6.78=7.28secT_{\text{TTFT}}(\text{P1}) = 0.5 + \frac{32768 - 19200}{2000} = 0.5 + 6.78 = 7.28\,\text{sec} ✓ (< 10s SLO)

TTTFT(P2)=0.5+3276896002000=0.5+11.58=12.08secT_{\text{TTFT}}(\text{P2}) = 0.5 + \frac{32768 - 9600}{2000} = 0.5 + 11.58 = 12.08\,\text{sec} ✗ (> 10s SLO)

TTTFT(P3)=0.5+327682000=0.5+16.38=16.88secT_{\text{TTFT}}(\text{P3}) = 0.5 + \frac{32768}{2000} = 0.5 + 16.38 = 16.88\,\text{sec} ✗ (> 10s SLO)

TTTFT(P4)=0.5+32768128002000=0.5+9.98=10.48secT_{\text{TTFT}}(\text{P4}) = 0.5 + \frac{32768 - 12800}{2000} = 0.5 + 9.98 = 10.48\,\text{sec} ✗ (> 10s SLO)

Step 3 — Remote cache transfer option for P4 (has second-best cache): If P4 transfers the missing 19,200 - 12,800 = 6,400 tokens worth of blocks from P1:

  • Transfer size = 640016×5MB=2,000MB\frac{6400}{16} \times 5\,\text{MB} = 2,000\,\text{MB}
  • Transfer time = 2GB100GB/s=20ms\frac{2\,\text{GB}}{100\,\text{GB/s}} = 20\,\text{ms}

TTTFT(P4 + transfer from P1)=0.5+20ms+327681920020007.32secT_{\text{TTFT}}(\text{P4 + transfer from P1}) = 0.5 + 20\,\text{ms} + \frac{32768 - 19200}{2000} \approx 7.32\,\text{sec}

Step 4 — Select best:

  • P1: 7.28s ✓ (best with its own cache)
  • P4 + P1 cache: 7.32s ✓ (slightly worse due to transfer)

Conductor selects P1 (local cache, 7.28s TTFT, meets SLO).

Step 5 — Decoding instance: D1 load = 60%, D2 load = 30% → select D2 (lighter load, TBT SLO feasible).

Step 6 — SLO gate passes. Conductor assigns (P1, D2) for this request.

This example illustrates why the Conductor needs both cache-awareness (P1’s large cache avoids 60% of prefill) and load-awareness (overloaded P2 would miss the TTFT SLO despite having 9.6k cached tokens).

KV Cache Eviction, Replication, and the Storage Hierarchy

A distributed KV cache pool introduces classic distributed systems problems: what happens when DRAM fills up? How do you avoid hotspots? How do you decide when to replicate versus evict?

Three-Tier Hierarchy

graph TD
    subgraph GPU_VRAM["GPU VRAM (80 GB / node, ~1 TB/s BW)"]
        ActiveKV["Active request KV blocks<br/>— lowest latency, smallest capacity"]
    end
    subgraph CPU_DRAM["CPU DRAM (1-2 TB / node, ~50 GB/s BW)"]
        WarmKV["Warm KV blocks<br/>— recent prefix cache, frequent access"]
    end
    subgraph SSD["SSD (multi-TB / node, ~3 GB/s BW)"]
        ColdKV["Cold KV blocks<br/>— infrequent access, long-term persistence"]
    end
    ActiveKV -- "eviction (idle request KV)" --> WarmKV
    WarmKV -- "eviction (LRU/LFU)" --> ColdKV
    ColdKV -- "on-demand load" --> WarmKV
    WarmKV -- "on-demand load" --> ActiveKV

Figure A1 – KV cache storage hierarchy. Capacity increases 10-100× at each tier while bandwidth decreases proportionally. The Conductor’s scheduling algorithm accounts for read latency from each tier when estimating TTFT.

Eviction Policies

The paper evaluates three eviction policies for the CPU DRAM tier:

  • LRU (Least Recently Used): Evict the block that was accessed least recently. Works well when access pattern has temporal locality (repeated conversations, common system prompts).
  • LFU (Least Frequently Used): Evict the block with the fewest access counts. Better for workloads where certain prefixes are structurally popular regardless of recency.
  • Request-characteristic-based: Use per-request metadata (remaining output budget, SLO tier, request priority) to guide eviction. Prioritize keeping KV blocks for high-value or nearly-complete requests.

In production, Mooncake uses a combination: LRU for general blocks, with a “pinned” category for blocks that are currently in-flight (being used by an active decode instance or being transferred).

Hot-Spot Replication

When the Conductor detects that a particular prefix block set is being requested by many concurrent requests (measured as best_cache_len / len(request) > hot_threshold), it proactively replicates those blocks to multiple prefill nodes:

Procedure: Hot-Spot Cache Replication

IF block_hashes[0..N] are requested by ≥ K concurrent requests within window W:
    FOR each prefill node p that doesn't hold these blocks:
        IF p has sufficient DRAM capacity:
            TransferBlocks(source_node → p, block_hashes[0..N])
            p.cached_blocks.add(block_hashes[0..N])
    END FOR

// After replication, subsequent requests can be routed to local copies,
// avoiding repeated cross-node fetches and spreading the cache hit load.

This is analogous to CDN edge caching: once a piece of content becomes popular, it gets replicated to edge nodes closer to users. The threshold KK and window WW control the trade-off between replication bandwidth cost and cache hit rate improvement.

VRAM Budget Accounting

For the decoding cluster, each active request occupies a contiguous (or paged) slice of GPU VRAM for its KV blocks. The Conductor tracks VRAM utilization per decoding instance to avoid over-committing:

KVrequest=ncurrent tokens×KVper tokenncurrent×320KB\text{KV}_{\text{request}} = n_{\text{current tokens}} \times \text{KV}_{\text{per token}} \approx n_{\text{current}} \times 320\,\text{KB}

At scheduling time, the Conductor checks that adding this request’s expected KV footprint (up to max_tokens × KV_per_token) does not exceed the decoding instance’s VRAM budget minus a safety margin for the model weights themselves (which are fixed at ~140 GB for LLaMA2-70B with 8-way tensor parallelism across 8 GPUs).

System-Level Goodput Formulation

Mooncake’s evaluation metric is goodput rather than raw throughput: only requests that complete under their SLO constraints count toward goodput. This matters because a system that processes twice as many requests but violates SLOs for half of them is worse than a slower system with 100% SLO compliance.

Goodput={Rcompleted:TTFT(R)TTFT_SLO(R)t:TBTt(R)TBT_SLO(R)}Tobservation\text{Goodput} = \frac{|\{R \in \text{completed} : \text{TTFT}(R) \leq \text{TTFT\_SLO}(R) \land \forall t: \text{TBT}_t(R) \leq \text{TBT\_SLO}(R)\}|}{T_{\text{observation}}}

The denominator is the observation window duration; the numerator counts only fully SLO-compliant completed requests.

Mooncake’s 75% higher goodput on the real Kimi workload means that under the same GPU fleet, Mooncake finishes 1.75× more requests within their latency targets. This directly translates to business value: either 43% fewer GPUs are needed for the same request volume, or the same fleet can serve 75% more users.

Throughput vs. Goodput: Why the Distinction Matters

Consider two systems A and B serving 100 requests/sec:

  • System A: 100 req/s throughput, 80% SLO compliance → 80 req/s goodput
  • System B: 85 req/s throughput, 100% SLO compliance → 85 req/s goodput

System B has lower throughput but higher goodput. In a production environment where SLO violations trigger customer escalations or refunds, system B is clearly preferable. The vLLM baseline achieves high raw throughput but only 57% TBT compliance, making its effective goodput much lower than the raw numbers suggest.

Mooncake in the Context of the Disaggregated Serving Landscape

The idea of disaggregating prefill and decoding did not originate with Mooncake. The contribution is in how disaggregation is organized and what system is built on top of it. Here is a brief comparison:

SystemKey ContributionPrefill–DecodeKV Management
Orca (2022)Iteration-level schedulingCoupledLocal, per-request
vLLM (2023)PagedAttentionCoupledLocal paged blocks
DistServe (2024)Disaggregated P/DDisaggregatedLocal per-cluster
Sarathi-Serve (2024)Chunked prefillCoupled (interleaved)Local
Mooncake (2024)KV-centric scheduler + CPP + cross-cluster KV poolDisaggregatedDistributed, multi-tier, cluster-wide
Llumnix (2024)Live migrationDisaggregatedDistributed

The distinguishing Mooncake contribution is elevating the KV cache pool from a local implementation detail to a globally managed, first-class distributed resource. This allows the scheduler to make cache-transfer cost a first-order variable in routing decisions, which none of the prior systems did.

Implementation Details

The paper briefly describes the Mooncake implementation:

  • Language: The core serving engine is implemented in C++ with CUDA kernels for attention and MLP.
  • KV transfer: The Messenger service uses a custom RDMA library built on top of the ibverbs API (InfiniBand verbs), with GPUDirect RDMA memory registration.
  • Control plane: gRPC is used for Conductor-to-instance control messages (request assignment, cache metadata).
  • Data plane: NCCL is used within a node for GPU-to-GPU communication (e.g., across tensor parallel ranks); custom RDMA is used across nodes.
  • Cache metadata: A distributed hash map (implemented as a custom shard-per-node structure) stores the mapping from block_hash → (node_id, memory_address, reference_count).
  • Block size: Typically 16 tokens per block, matching the granularity used by vLLM’s PagedAttention for compatibility.

The total system implementation is described as being in the range of tens of thousands of lines of C++ and Python (for the Conductor and configuration management). The open-source KV transfer engine is available at the GitHub repository kvcache-ai/Mooncake.

Impact on the Serving Ecosystem

Mooncake was published in June 2024 and received the FAST 2025 Best Paper Award, reflecting its significance to the storage and distributed systems community (not just the ML community). Its ideas have influenced subsequent systems:

  • Llumnix (OSDI 2024): Takes disaggregated serving further with live migration of running decode requests between instances, enabling fine-grained load balancing and SLO priority enforcement.
  • KV-Fold (2024): Recurrent KV compression for very long contexts — a complementary technique that reduces KV size before transfer.
  • DistServe follow-on work: Explores optimal prefill:decode GPU ratio as a function of workload characteristics.

The core insight — that KV cache should be the primary scheduling resource, not a secondary consideration — is now widely accepted in the production LLM serving community.

Key Takeaways

Mooncake makes five contributions that are individually incremental but collectively represent a clean architectural design:

  1. Disaggregated prefill–decode with a KV-centric lens: Not just physically separate clusters, but an entire philosophy shift where KV cache placement drives all scheduling decisions.

  2. Distributed KV cache pool with hash-based prefix sharing: Extends prefix caching from a single-instance optimization to a cluster-wide resource, enabling dramatic cache hit rates for repeated-prefix workloads.

  3. Chunked Pipeline Parallelism (CPP): The first application of pipeline parallelism to inference-phase prefill, enabling multi-node scaling for individual ultra-long-context requests.

  4. Layer-wise async KV streaming: Overlaps computation with KV transfer to minimize TTFT overhead of disaggregation, and decouples VRAM requirements from context length.

  5. TTFT-driven early rejection with load prediction: Prevents wasted prefill compute under overload, protecting both throughput (goodput) and tail latency SLOs simultaneously.

The 75% throughput improvement on real production workloads demonstrates that these architectural choices are not academic exercises but meaningful improvements for real-world LLM serving at scale.

Prefix Caching: The Mathematics of Cache-Hit Rate

The potential savings from prefix caching are substantial and worth quantifying precisely. Define the prefix cache hit rate ρ\rho as the fraction of input tokens whose KV blocks are already in the cache:

ρ=cached_tokenstotal_prompt_tokens\rho = \frac{\text{cached\_tokens}}{\text{total\_prompt\_tokens}}

The compute savings from caching are:

Saved Prefill FLOPs=ρCprefill(L)\text{Saved Prefill FLOPs} = \rho \cdot C_{\text{prefill}}(L)

where Cprefill(L)C_{\text{prefill}}(L) is the full prefill FLOP count for a prompt of length LL:

Cprefill(L)=nlayers(4L2d+8Ld2)C_{\text{prefill}}(L) = n_{\text{layers}} \cdot \left(4 L^2 d + 8 L d^2\right)

(The 4L2d4L^2 d term is for QK attention; the 8Ld28Ld^2 term is for FFN/MLP with hidden ratio 4×.)

For L-Eval’s average input of 19,019 tokens with ρ=0.8\rho = 0.8 (80% cache hit) on LLaMA2-70B (d=8192d = 8192, nlayers=80n_{\text{layers}} = 80):

Effective prefill tokens=19019×(10.8)=3804\text{Effective prefill tokens} = 19019 \times (1 - 0.8) = 3804

This is a 5× reduction in prefill compute. Combined with reduced queue time at the prefill instances, this directly explains why L-Eval sees +40% throughput while ArXiv (ρ ≈ 0) sees only +20%.

The breakeven point for using a remote cache (paying transfer cost to gain compute savings) occurs when:

Ttransfer<Tfull_prefillTpartial_prefillT_{\text{transfer}} < T_{\text{full\_prefill}} - T_{\text{partial\_prefill}}

ρLKV_per_tokenBWRDMA<ρLrprefill\frac{\rho \cdot L \cdot \text{KV\_per\_token}}{BW_{\text{RDMA}}} < \frac{\rho \cdot L}{r_{\text{prefill}}}

KV_per_tokenBWRDMA<1rprefill\frac{\text{KV\_per\_token}}{BW_{\text{RDMA}}} < \frac{1}{r_{\text{prefill}}}

Substituting: KV_per_token=320KB\text{KV\_per\_token} = 320\,\text{KB}, BWRDMA=100GB/sBW_{\text{RDMA}} = 100\,\text{GB/s}, rprefill=2000token/sr_{\text{prefill}} = 2000\,\text{token/s}:

Left side: 320×1031011=3.2μs/token\frac{320 \times 10^3}{10^{11}} = 3.2\,\mu\text{s/token} (transfer cost per token)

Right side: 12000=500μs/token\frac{1}{2000} = 500\,\mu\text{s/token} (prefill cost per token)

Since 3.2μs500μs3.2\,\mu\text{s} \ll 500\,\mu\text{s}, transferring a cached KV block is always faster than recomputing it, as long as RDMA bandwidth is available. This mathematical guarantee is the fundamental reason KV cache disaggregation is profitable in Mooncake’s hardware environment.

Open Research Questions

The paper leaves several intellectually interesting open problems:

  1. Optimal prefill:decode GPU ratio. The paper uses fixed [3P+1D] or [10P+10D] configurations but does not derive the optimal ratio analytically. This depends on the workload’s input/output ratio and cache hit rate — an interesting optimization problem.

  2. Adaptive threshold tuning. The kvcache_balancing_threshold is currently manual. An online bandit or control-theoretic approach could tune it automatically as the workload distribution shifts throughout the day.

  3. Output length prediction. Accurate per-request output length prediction would transform the early rejection policy from “reject if current load is high” to “reject if predicted cost exceeds capacity.” This would enable much more precise goodput maximization.

  4. Hierarchical Conductor. The current Conductor is a centralized service. At very large scale (thousands of nodes), a hierarchical design (rack-level controllers reporting to a global Conductor) would improve scalability and reduce single-point-of-failure risk.

  5. KV compression before transfer. If KV blocks are compressed (e.g., using quantization or SVD-based approximation) before RDMA transfer, the effective bandwidth is multiplied. Papers like KV-Fold explore this complementary direction.

Further Reading

  • Orca (OSDI 2022): Iteration-level scheduling — the continuous batching foundation Mooncake builds on.
  • vLLM / PagedAttention (SOSP 2023): Fine-grained KV paging that Mooncake adopts and extends.
  • DistServe (OSDI 2024): Co-authored disaggregated serving work that influenced Mooncake’s prefill–decode separation.
  • Sarathi-Serve (OSDI 2024): Chunked-prefill approach — a coupled alternative to Mooncake’s full disaggregation.
  • Llumnix (OSDI 2024): Dynamic scheduling with live migration, extending Mooncake’s scheduler ideas.
  • SGLang (MLSys 2025): RadixAttention and efficient structured generation — complementary to Mooncake’s KV caching.
  • Mooncake open-source repository: The KV transfer engine (Messenger) is open-sourced at github.com/kvcache-ai/Mooncake, enabling the community to reuse the RDMA-based cross-node KV transfer primitives in other serving systems.

This review is based on Mooncake arXiv:2407.00079 (Qin et al., 2024), winner of the FAST 2025 Best Paper Award. All performance numbers are from the paper’s experimental evaluation on A800 GPU clusters using LLaMA2-70B architecture with dummy weights. The open-source KV transfer engine is available at github.com/kvcache-ai/Mooncake.

Reviewed by Zhongzhu Zhou, 2026-05-28.