July 4, 2026 EN #KV Cache #Long Context #LLM Inference

MosaicKV: Dynamic Two-Dimensional KV Cache Compression for Long-Context LLM Serving — Technical Review

Review date: 2026-07-04 Review author: Zhongzhu Zhou Paper reviewed: MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression Paper authors: Sheng Qiang, Ruiwei Chen, Yinpeng Wu, Jinyu Gu, Zhichao Hua, Yubin Xia, Binyu Zang, Haibo Chen (IPADS, Shanghai Jiao Tong University) arXiv: 2607.00760 Venue / Status: arXiv preprint, July 2026

Short Answer

MosaicKV is a system for long-context LLM inference that achieves much higher KV cache compression than prior methods by compressing along both the sequence dimension (token selection) and the channel dimension (head-dimension element pruning) simultaneously — a combination that previous systems found too damaging to accuracy to use in practice.

The core insight that makes this work is recognizing that KV cache importance is non-uniform: the important elements in one token’s KV vector are not the same channels that matter for another token’s vector. Applying a global channel mask (the approach of earlier methods like ThinK) throws away important information in specific vectors. MosaicKV instead selects elements per-vector, then packs the resulting irregular sparse representations into a dense GPU-friendly format using a custom attention kernel. A heterogeneous CPU/GPU double-buffering scheme offloads the expensive SVD and strategy computations off the critical decode path.

The result on an H800 GPU: up to 16× attention speedup, 4.8× lower decode latency, 7.3× higher throughput, 3× lower memory — with only 1.76% average accuracy loss on LongBench and RULER.

Prerequisites

Before diving in, here is the background needed to follow the technical details.

Transformer Attention and the KV Cache

In a Transformer decoder, each attention layer takes a sequence of token embeddings and computes self-attention: for each new token, the model asks “which previous tokens are relevant to me?” by computing dot-product similarity between the query ( $Q$ ) of the current token and the keys ( $K$ ) of all previous tokens:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d}}\right) V \tag{1}

where $d$ is the head dimension (typically 64–128). The resulting weighted sum over values ( $V$ ) is what gets added to the residual stream.

During autoregressive decoding (generating tokens one at a time), we need the keys and values of every previously seen token at every layer. Rather than recomputing them from scratch on each step, inference engines store them in a KV cache in GPU memory. This is fundamentally why the KV cache exists: it trades memory for avoiding redundant computation.

KV Cache Memory Scaling

The size of the KV cache is:

\text{Size} = 2 \times L \times T \times H \times d \times \text{bytes\_per\_element} \tag{2}

where $L$ is the number of layers, $T$ is the sequence length, $H$ is the number of attention heads, $d$ is the head dimension, and the factor of 2 accounts for both keys and values.

For LLaMA-3.1-8B (32 layers, 32 heads, 128-dimensional heads, bf16):

At $T = 128K$ tokens: $2 \times 32 \times 131072 \times 32 \times 128 \times 2 \approx 128$ GB
At $T = 1M$ tokens: $\approx 1$ TB

This is catastrophic for GPU serving. An H100 has 80 GB HBM; 128K context already blows past it.

A quick sanity check: with bf16 (2 bytes per element), each attention head of LLaMA-3.1-8B (32 heads, 128-dim each) stores $128 \times 2 = 256$ bytes of K and 256 bytes of V per token per layer. Over 32 layers: $256 \times 2 \times 32 = 16384$ bytes ≈ 16 KB per token. At 128K tokens this is 2 GB per request, and with 8 concurrent requests: 16 GB. The numbers in the paper (128 GB for 8 requests at 128K) differ because they count all heads total, not per head: $2 \times 32 \times 131072 \times 32 \times 128 \times 2 = 137$ GB, confirming the 128 GB figure.

SVD and Rotation Matrices

Singular Value Decomposition (SVD) of a matrix $A \in \mathbb{R}^{m \times n}$ is:

A = U \Sigma V^\top \tag{3}

where $U, V$ are orthonormal matrices of left/right singular vectors, and $\Sigma$ is diagonal with singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$ .

For channel compression, we use SVD to rotate the KV cache so that the most “important” information concentrates in the first few channels. Given key matrix $K \in \mathbb{R}^{T \times d}$ , the rotation matrix $R_k$ is the right singular matrix of $K$ (equivalently, eigenvectors of $K^\top K$ ):

K^\top K = V_k \Sigma_k^2 V_k^\top, \quad R_k = V_k^\top \tag{4}

Applying this rotation: $K' = K R_k^\top$ gives a rotated key matrix where the first channels have the highest variance (most information). Retaining only the top- $r$ channels of $K'$ is then a principled low-rank approximation.

The energy preserved by keeping the top- $r$ channels is:

\text{Energy}(r) = \frac{\sum_{i=1}^{r} \sigma_i^2}{\sum_{i=1}^{d} \sigma_i^2} \tag{5b}

For typical attention head KV matrices in large LLMs, the singular value spectrum is highly concentrated: the top 30% of channels often capture >85% of the total variance. This is why SVD-based channel compression works at all — most of the “information” is packed into a small subspace.

With this background in hand, let us understand why long-context serving is so hard and how MosaicKV approaches the problem.

The Long-Context Memory Crisis

The deployment of frontier LLMs with million-token context windows (GPT-5.4, Gemini 3.1 Pro, Claude 4.6) has exposed a fundamental tension: the very feature users want (long context) is exactly what makes serving economically unsustainable.

The KV cache bottleneck is worse than equation (2) might suggest in isolation, because it compounds with batch serving. Serving $B = 8$ concurrent requests at 128K tokens requires $8 \times 128$ GB = 1 TB of KV cache — before a single GPU computes anything else.

Beyond memory, the computational share of attention grows with context:

At 128K context: attention accounts for 84.7% of time-between-tokens (TBT)
At 1M context: attention accounts for 97.9% of TBT

This is because attention over $T$ tokens is $O(T)$ per decode step, while the feedforward networks (which dominate at short contexts) are $O(1)$ per token. As context grows, attention increasingly dominates the wall-clock latency.

Figure 1 below summarizes the problem and the solution space.

Figure 1: KV Cache Memory Bottleneck and the 2D Compression Solution Space

graph TD
    A[Long-Context Request<br/>128K - 1M tokens] --> B[KV Cache Per Request<br/>128 GB at 128K<br/>1 TB at 1M]
    B --> C{Compression<br/>Approach?}
    C --> D[Sequence Compression<br/>Token Selection Only]
    C --> E[Channel Compression<br/>Dimension Reduction Only]
    C --> F[Naive 2D<br/>Both Combined]
    C --> G[MosaicKV<br/>Dynamic 2D]
    D --> D1[Retains full KV in memory<br/>Limited compression ratio<br/>Accuracy degrades at high rate]
    E --> E1[Global channel mask<br/>Misses per-vector variation<br/>Cannot do token selection]
    F --> F1[82.8pct accuracy loss<br/>at 70pct channel compression<br/>UNUSABLE]
    G --> G1[Per-vector element selection<br/>Segment-adaptive strategy<br/>16x speedup, 1.76pct loss]
    style G fill:#4CAF50,color:#fff
    style F1 fill:#f44336,color:#fff

The Landscape of KV Cache Compression

Before MosaicKV, two lines of work had independently attacked the problem from different angles.

Sequence Compression (Token Selection)

The observation behind sequence compression is that not all historical tokens matter for a given query. The softmax operation in equation (1) exponentially amplifies score differences, so the attention mass concentrates on a small fraction of “important” tokens. If we can identify these tokens cheaply, we only need to run full attention over them.

Quest (Tang et al., 2024) operates at block granularity: it computes coarse scores per block of tokens by comparing the query against min/max bounds of each block’s keys, then selects the top- $k$ blocks for full attention. This avoids the per-token score computation overhead.

H2O (Zhang et al., 2023) uses a static “heavy hitter” policy — tokens that have accumulated high attention scores in the past are likely to be important in the future. It maintains a fixed-size “heavy hitter” KV cache alongside a small recent-token window.

The fundamental limitation: Sequence compression must keep the entire KV cache in GPU memory, because any token might be selected at any decode step. At 1M tokens, this is 1 TB of KV cache regardless of how few tokens are actually used.

Additionally, as Figure 2 (described below) shows, accuracy degrades sharply at high sequence compression rates. Pushing token retention below ~10% incurs significant quality loss on tasks like multi-document QA that genuinely require many tokens.

Channel Compression (Head-Dimension Reduction)

The observation behind channel compression is that KV vector elements are not equally important: large-magnitude “outlier” elements carry most of the information, concentrated in a small subset of channels. If we can identify and retain these channels, we can store much shorter vectors.

ThinK (Xu et al., 2025) computes a uniform channel mask from the prefill-stage key cache. It runs SVD rotation on $K$ , then selects the top- $r$ channels as the globally important ones. All subsequent key vectors use this same mask. For value vectors, ThinK falls back to sequence compression only, because it found applying channel compression to both K and V in 2D mode too damaging.

ShadowKV (Sun et al., 2025) also uses SVD rotation but adds a “shadow key” mechanism for token selection: a low-precision summary of each key block is kept for scoring, while the stored KV cache is channel-compressed.

The fundamental limitation: Global channel masks miss per-vector importance variation. A channel that is globally unimportant may be critical for specific tokens. Channel compression alone also cannot reduce the number of tokens stored.

Why Combining the Two Naively Fails

The paper’s strawman experiment is instructive. They take Quest (sequence compression) and add a global SVD channel compression on top, compressing both K and V vectors to their top- $r$ channels. The accuracy results on LongBench:

Channel Compression Rate	Accuracy Loss vs. Sequence-Only
10%	~5%
30%	24.5%
50%	~55%
70%	82.8%

At the channel compression rates needed for real memory reduction (>50%), the accuracy collapse is catastrophic. This is not a tuning problem — it is a fundamental misalignment between the global channel mask assumption and the actual importance distribution in the KV cache.

The Three Observations that Make MosaicKV Possible

MosaicKV is motivated by three profiling insights that together define the design space.

Observation 1: Non-Uniform Per-Vector Element Importance

After SVD rotation, one might expect the rotated vectors to have their important elements uniformly concentrated in the top channels — that is the whole point of the rotation. But in practice, this is only approximately true.

The paper reports: across the full KV cache, only 62.28% of the top-25% most important elements in individual vectors reside in the globally top-25% channels — even at a small 256-token context. As context grows, this discrepancy worsens.

This means a global channel mask with 25% retention misclassifies ~38% of the truly important elements, labeling them as expendable when they are not. The longer the context, the more diverse the KV vectors, and the worse this misclassification becomes.

Implication: We need per-vector element selection, not a global mask.

Observation 2: Distribution Variation Across KV Cache Regions

The KV cache is not homogeneous. Its elements exhibit different statistical properties in different regions:

High-variance regions: A few channels have very large magnitude (outliers), while most are near zero. Here, aggressive channel compression works well because the outliers dominate.
Low-variance regions: Elements are distributed more uniformly. Here, aggressive channel compression discards useful information because no single channel stands out.

A fixed compression strategy applied uniformly to all regions will over-compress some regions and under-compress others. An adaptive per-segment strategy can maintain accuracy while achieving higher average compression.

Implication: We need a segment-adaptive compression strategy, not a global one.

Observation 3: Extreme GPU Resource Imbalance During Decode

Profiling attention computation during the decode stage of a 256K-token request on LLaMA-3.1-8B (A800 GPU, using FlashInfer):

Memory bandwidth utilization: 90.5% — essentially saturated
CUDA core utilization: 10.35% — nearly idle
CPU utilization: ~0% — completely idle

This imbalance is a direct consequence of memory-bandwidth-boundness: the bottleneck is reading KV cache entries from HBM, not computing with them. CUDA cores and the CPU sit idle while the memory bus is the bottleneck.

Implication: We can exploit underutilized CUDA cores for custom sparse attention, and use the CPU for asynchronous compression management — at essentially zero opportunity cost.

Dynamic Two-D Compression: The Core Algorithm

MosaicKV’s primary contribution is a two-stage compression algorithm that applies both sequence and channel compression while exploiting per-vector and per-segment variation.

Figure 2: Dynamic Two-D Compression Pipeline

flowchart LR
    A[Raw KV Cache<br/>K: T-by-d matrix<br/>V: T-by-d matrix] --> B[Step 1: SVD Rotation<br/>R_k = eigvecs of K-transpose-K<br/>K-rotated = K times R_k-transpose]
    B --> C[Step 2: Partition<br/>into Segments<br/>S_1 ... S_m]
    C --> D[Step 3: Per-Segment<br/>Strategy on CPU<br/>compute optimal r_j<br/>for each segment S_j]
    D --> E[Step 4: Per-Vector<br/>Element Selection<br/>For each vector k-prime-i<br/>find its own top-r elements]
    E --> F[Sparse KV Repr<br/>k-hat-i = values + indices<br/>v-hat-i = values + indices]
    F --> G[Token Selection<br/>Block-level scores<br/>over compressed K]
    G --> H[Packed Sparse<br/>Attention<br/>2D Compressed Output]

Step 1: SVD Rotation for Channel Concentration

Given key matrix $K \in \mathbb{R}^{T \times d}$ collected during prefill, we compute the rotation:

K^\top K = V_k \Sigma_k^2 V_k^\top \tag{5}

R_k = V_k^\top, \quad K' = K R_k^\top \tag{6}

After rotation, $K'$ has its high-variance directions in the first channels. The same procedure yields $R_v$ and $V' = V R_v^\top$ for value vectors.

Note that this rotation must be applied to every new key vector as it arrives during decode: $k'_\text{new} = k_\text{new} R_k^\top$ . This is an $O(d^2)$ matrix-vector product per new token per layer — cheap relative to the attention computation itself.

Step 2: Per-Vector Element Selection

Instead of selecting the same $r$ channels for all vectors (ThinK’s approach), MosaicKV selects the top- $r$ elements individually for each vector.

For token $i$ with rotated key vector $k'_i \in \mathbb{R}^d$ :

\mathcal{I}_i = \text{argtopk}(|k'_i|, r), \quad \hat{k}_i = (k'_i[\mathcal{I}_i],\ \mathcal{I}_i) \tag{7}

where $\text{argtopk}(|k'_i|, r)$ returns the indices of the $r$ largest-magnitude elements. The sparse representation $\hat{k}_i$ stores $r$ (value, index) pairs.

The same procedure applies to value vectors: $\hat{v}_i = (v'_i[\mathcal{J}_i], \mathcal{J}_i)$ for some retention count $r_v$ (which may differ from $r$ ).

Pseudocode for Per-Vector Element Selection:

Algorithm 1: Per-Vector Element Selection
Input:  K' ∈ R^{T×d}  (SVD-rotated key cache)
        r (number of elements to retain per vector)
Output: K̂ = list of (values, indices) per vector

for i = 1 to T:
    magnitudes = |K'[i, :]|              # d-dimensional magnitude vector
    I_i = argtopk(magnitudes, r)         # top-r indices by magnitude
    K̂[i].values = K'[i, I_i]            # selected values
    K̂[i].indices = I_i                  # their positions

return K̂

The key difference from ThinK: the index set $\mathcal{I}_i$ is different for each token, while ThinK uses a single shared index set $\mathcal{I}^*$ for all tokens.

Step 3: Dynamic Segment-Adaptive Compression Strategy

The KV cache is partitioned into segments $S_1, S_2, \ldots, S_m$ (each containing $s$ consecutive tokens). For each segment, the CPU computes a compression strategy — the per-element retention ratio — based on the segment’s distribution characteristics.

r_j = f(\text{stats}(S_j)) \tag{8}

where $f$ maps segment statistics (e.g., variance, outlier ratio, entropy of magnitude distribution) to an optimal retention count. High-variance segments (with pronounced outliers) can be more aggressively compressed because outlier channels dominate; low-variance segments require higher retention to avoid losing information.

The total budget constraint is:

\sum_{j=1}^{m} r_j \cdot |S_j| \leq R_\text{budget} \tag{9}

where $R_\text{budget}$ is the target total element budget. This is a constrained allocation problem solved on the CPU per-segment.

Pseudocode for Segment Strategy Generation:

Algorithm 2: Dynamic Segment Strategy Generation
Input:  KV cache segment S_j ∈ R^{s×d}  (rotated)
        R_budget (total element budget for this segment)
Output: r_j (per-vector retention count for this segment)

# Compute segment statistics
variance = mean(var(S_j, axis=0))         # channel-wise variance
outlier_ratio = count(|S_j| > threshold) / (s * d)
entropy = -sum(p * log(p))  # magnitude histogram entropy

# High variance / outlier-heavy: can compress more aggressively
if outlier_ratio > outlier_thresh:
    r_j = R_budget / s * (1 - alpha * outlier_ratio)
else:
    r_j = R_budget / s * (1 + alpha * (1 - outlier_ratio))

r_j = clamp(r_j, r_min, r_max)
return r_j

This adaptation is what gives MosaicKV its “Mosaic” name: different KV cache regions get different compression patterns, like tiles of a mosaic.

Token Selection over Two-D Compressed KV

For block-level token selection (analogous to Quest), MosaicKV computes approximate attention scores using only the compressed key vectors. For query $q' = q R_k^\top$ and a block of tokens $B_\ell$ :

s_\ell = \frac{1}{\sqrt{d}} \max_{i \in B_\ell} q'[\mathcal{I}_i]^\top \hat{k}_i[\mathcal{I}_i] \tag{10}

The top- $k$ blocks by score are selected for full attention computation. Both K and V for the selected tokens are available in their sparse form.

The full two-D compressed attention over selected tokens is:

o = \text{softmax}\!\left(\frac{q' \hat{K}^\top}{\sqrt{d}}\right) \hat{V} \tag{11}

where $\hat{K}$ and $\hat{V}$ are the packed sparse representations of the selected tokens.

Finally, the output must be rotated back:

\text{output} = o \cdot R_v \tag{12}

Compressed KV Cache Management: Exploiting Idle Resources

The dynamic per-vector approach introduces two new engineering challenges:

Sparse access overhead: Per-vector different index patterns create irregular memory access, destroying GPU cache locality and causing excessive HBM cache misses.
Management overhead: SVD computation, strategy generation, and recompression are expensive and cannot run on the decode critical path.

MosaicKV solves both with its compressed KV cache management system.

Packed Sparse Attention

The irregular index patterns of per-vector element selection ( $\hat{k}_i$ has different index set $\mathcal{I}_i$ ) make naive sparse attention expensive. A direct implementation would:

For each token $i$ , read $\hat{k}_i$ .values from scattered memory locations
This causes repeated HBM cache misses → throughput collapse

MosaicKV’s fix: Pack all sparse vectors into a dense auxiliary format.

For $n$ selected tokens, each with $r$ retained elements:

Pack values into $P_K \in \mathbb{R}^{n \times r}$ (contiguous in memory)
Pack index arrays into $I_K \in \mathbb{Z}^{n \times r}$

The attention computation uses $P_K$ directly (dense matrix access, no scatter/gather), but requires a modified attention kernel that accounts for the fact that element $P_K[i, j]$ corresponds to dimension $I_K[i, j]$ in the original space.

This custom kernel is implemented using CUDA cores rather than tensor cores. Tensor cores are optimized for structured matrix multiplications with specific size/alignment requirements. The irregular (but packed) format of $P_K$ is better handled by CUDA cores, which provide more flexible thread-level computation. The 10.35% CUDA core utilization observed during baseline attention leaves ample capacity for this.

Figure 3: Packed Sparse Attention vs. Unpacked Sparse Access

graph LR
    subgraph NaiveSparse["Naive Sparse Slow"]
        A1[k-hat-1: val 3.1,-2.4 idx 5,12] --> C1[Scattered HBM reads<br/>Cache miss per element]
        A2[k-hat-2: val 1.8,0.9 idx 3,7] --> C1
        A3[k-hat-3: val 2.2,-1.1 idx 9,2] --> C1
    end
    subgraph MosaicPacked["MosaicKV Packed Dense Fast"]
        B1[P_K: dense n-by-r matrix<br/>contiguous memory] --> C2[CUDA core attention<br/>Dense access no cache miss]
        B2[I_K: index matrix n-by-r] --> C2
    end
    C1 --> D1[High cache miss rate<br/>Memory BW not saved]
    C2 --> D2[Low cache miss rate<br/>Full BW savings realized]

Heterogeneous Double Compression Buffering

The second challenge is management overhead. SVD computation on a segment $S_j$ takes $O(s \cdot d^2)$ time — tens of milliseconds for typical parameters. This cannot be done inline during decode without stalling generation.

MosaicKV uses a double-buffering scheme with two asynchronous pipelines:

GPU-side buffer (fast, approximate): When new KV vectors arrive during decode (from newly generated tokens), they are immediately compressed using a simple fast scheme (e.g., global rotation + fixed top- $r$ channels) and stored in a GPU buffer. These approximately-compressed vectors are used for subsequent decode steps without delay.

CPU-side buffer (slow, optimal): In parallel, the CPU receives a copy of the raw new KV vectors. It performs the full computation:

Accumulate new vectors into the current segment
When segment is full: compute per-segment SVD rotation
Apply per-segment strategy selection (Algorithm 2)
Apply per-vector element selection (Algorithm 1)
Pack into dense format $(P_K, I_K)$

Switching mechanism: When the CPU finishes processing a segment, it triggers an atomic swap: the GPU buffer’s approximate representation for that segment is replaced with the CPU’s optimal representation. This swap is carefully synchronized to not block any in-flight decode operation.

Figure 4: Heterogeneous Double Compression Buffering

sequenceDiagram
    participant GPU as GPU (Decode)
    participant GPUB as GPU Buffer
    participant CPUB as CPU Buffer
    participant CPU as CPU (Async)

    GPU->>GPUB: New KV vectors arrive
    GPUB->>GPUB: Fast compression<br/>(global rotation + fixed top-r)
    GPU->>GPUB: Read compressed KV<br/>for current decode step
    Note over GPU,GPUB: Decode continues<br/>without waiting

    GPUB->>CPUB: Copy raw KV vectors to CPU
    CPU->>CPUB: Accumulate until segment full
    CPU->>CPU: Compute per-segment SVD
    CPU->>CPU: Run Strategy Selection (Alg 2)
    CPU->>CPU: Run Per-Vector Selection (Alg 1)
    CPU->>CPU: Pack into dense format

    CPU->>GPUB: Atomic swap: replace fast-compressed<br/>with optimally-compressed segment
    Note over GPUB: Future decode steps use<br/>optimal compression

This design achieves both low management latency (decode never stalls) and high compression quality (CPU has time to compute the optimal strategy).

System Architecture Overview

Figure 5: MosaicKV Full System Architecture

graph TD
    A[Long-Context Request<br/>100K to 1M tokens] --> B[Standard Transformer Prefill<br/>Compute Q K V for all input tokens]
    B --> C[Compute SVD Rotation Matrices<br/>R_k from K-transpose-K eigenvectors<br/>R_v from V-transpose-V eigenvectors]
    C --> D[Rotate KV Cache<br/>K-rotated = K times R_k-transpose<br/>V-rotated = V times R_v-transpose]
    D --> E[Initial Segment Compression<br/>CPU: per-segment strategy<br/>GPU: per-vector selection + pack]
    E --> F[New token KV vector arrives during decode]
    F --> G[Fast GPU Compression<br/>global rotation + fixed top-r<br/>into GPU buffer]
    G --> H[Token Selection<br/>Block scores over compressed K<br/>Select top-k blocks]
    H --> I[Packed Sparse Attention<br/>CUDA cores on packed P_K and P_V<br/>Produce attention output]
    I --> J[Rotate output back<br/>output times R_v<br/>Continue decode loop]
    J --> F
    G --> K[CPU receives raw KV copies async]
    K --> L[Accumulate to full segment]
    L --> M[Per-segment SVD + strategy generation]
    M --> N[Per-vector element selection<br/>Pack into dense format]
    N --> O[Atomic swap into GPU buffer<br/>replace fast with optimal compression]
    O --> G

Evaluation and Results

Experimental Setup

Hardware: H800 GPU (80 GB HBM)
Models: LLaMA-3.1-8B (and additional models in supplementary results)
Baselines: Uncompressed baseline, Quest (sequence-only), ThinK (channel-only or partial 2D), ShadowKV (SVD-enhanced channel + limited 2D)
Benchmarks: LongBench (multi-task long-context understanding), RULER (synthetic recall tasks)
Context range: 64K to 1M tokens

Memory Reduction

At 128K context length, MosaicKV reduces KV cache memory by 3× vs. uncompressed baseline. The mechanism: both K and V are compressed to ~30% of channels and only ~10% of tokens are retained for attention. The multiplicative effect:

\text{Compression Ratio} = \frac{1}{(T_\text{retained} / T) \times (r / d)} = \frac{1}{0.10 \times 0.30} \approx 33\times \tag{13}

The headline 3× memory reduction in the paper is relative to a system that already uses Flash Attention’s block-sparse storage; the raw compression of the retained KV content is much higher.

This memory reduction directly enables larger batch sizes, which translates to the throughput gains.

Attention Speedup

MosaicKV achieves up to 16× attention speedup over the uncompressed baseline. This speedup comes from two sources:

Fewer elements to process: Attention over $s_\text{selected}$ tokens with $r$ channels instead of $T$ tokens with $d$ channels:

\text{Theoretical Speedup} = \frac{T \times d}{s_\text{selected} \times r} \tag{14}

At 10% token selection, 30% channel retention: $\frac{1}{0.10 \times 0.30} \approx 33\times$ theoretical speedup. The realized 16× accounts for overhead in packed sparse attention kernel.

Better cache utilization: The packed dense format for attention avoids scattered HBM accesses that would dominate latency in an unoptimized sparse implementation.

Decode Latency and Throughput

Decode latency: 4.8× lower than uncompressed baseline at 256K context
Throughput: 7.3× higher (measured as tokens/second across batch)

The throughput gain exceeds the latency improvement because lower memory usage enables larger batches — more requests can be served simultaneously.

Figure 6: Performance Summary vs. Baselines

graph LR
    A["Uncompressed<br/>1.0x throughput"] --> B["Quest seq-only<br/>~2.1x throughput"]
    B --> C["ThinK chan-only<br/>~1.8x throughput"]
    C --> D["MosaicKV full-2D<br/>7.3x throughput"]
    style A fill:#cccccc
    style B fill:#88aaff
    style C fill:#88aaff
    style D fill:#4CAF50,color:#fff

System	Relative Throughput	Accuracy Loss	Memory Reduction
Uncompressed	1.0×	0%	1×
Quest (seq only)	~2.1×	~5-15%	~1× (full KV in mem)
ThinK (channel)	~1.8×	~8%	~1.5×
MosaicKV	7.3×	1.76%	3×

(Quest and ThinK numbers are approximate relative figures; MosaicKV’s 7.3× and 1.76% are directly reported by the paper vs. uncompressed baseline.)

Accuracy on LongBench and RULER

Average accuracy loss: 1.76% across both LongBench and RULER benchmarks.

This is remarkable given the compression ratio. For comparison:

Quest alone at equivalent token retention: ~8-15% accuracy loss (at very high compression)
ThinK at 30% channel compression: ~8% accuracy loss
Naive 2D combination: 24.5–82.8% accuracy loss (depending on channel compression rate)

The accuracy preservation comes directly from the per-vector element selection: each vector retains its own most important elements, preventing the systematic information loss of global masking.

Ablation Results

The paper includes ablations demonstrating:

Per-vector selection vs. global mask: Per-vector selection recovers ~60% of the accuracy lost by global masking at equivalent compression ratios
Segment-adaptive strategy vs. uniform: Adaptive strategy provides 2-5% additional accuracy at same memory budget
CPU buffering: Without buffering, management overhead would add ~30% to decode latency

Worked Numerical Example: Compression Ratio at 256K Context

To make the compression ratio concrete, consider LLaMA-3.1-8B serving a single request at $T = 256K$ tokens, with MosaicKV configured to:

Token selection rate: 10% (select 25,600 out of 256,000 token blocks)
Channel retention rate: 30% (keep $r = 38$ out of $d = 128$ channels per head)

Original KV cache per attention layer (one head):

\text{Size}_\text{original} = 2 \times 256000 \times 128 \times 2 \text{ bytes} = 131 \text{ MB} \tag{15}

MosaicKV compressed KV per layer per head (including index storage):

Value storage: $2 \times 25600 \times 38 \times 2 = 3.9$ MB (selected tokens × channels × K+V × bf16)
Index storage: $2 \times 256000 \times 38 \times 2 = 39$ $2 \times 256000 \times 38 \times 2 = 39$ MB (all tokens still need indices for selection) Wait — actually only the selected tokens need full attention; but all tokens’ keys need to be stored (at compressed channel count) for the selection step itself. Let me clarify:
- Compressed full K cache (for token selection): $256000 \times 38 \times 2 = 19.5$ MB
- Selected K+V for attention: $25600 \times 38 \times 2 \times 2 = 3.9$ MB
- Index arrays: $256000 \times 38 \times 2 = 19.5$ MB

Total per head: ~42 MB vs 131 MB original → 3.1× reduction (consistent with paper’s ~3×).

For the full model (32 heads, 32 layers): from ~134 GB to ~43 GB — now fits with headroom for model weights (~16 GB) and activations on an H800.

Comparison with Prior Systems

Table 1: MosaicKV vs. Prior KV Cache Compression Methods

System	Compression Dims	Token Selection	Channel Compression	Accuracy Loss	Throughput Gain
Quest	Sequence only	Query-aware blocks	None	~5-15% at high ratio	~2×
ThinK	Channel (K only)	Optional	Global mask (K only)	~8% at 30% chan	~1.8×
InfiniGen	Channel for selection	SVD-accelerated	Full KV retained	~3%	~1.5×
ShadowKV	Partial 2D	SVD-enhanced	SVD-based	~5%	~3×
MosaicKV	Both (K and V)	Block-level	Per-vector, adaptive	1.76%	7.3×

Limitations and Boundary Conditions

SVD rotation cost: The initial SVD computation during prefill adds latency proportional to $O(T d^2)$ . For very long prefill sequences (>512K tokens), this could become significant. The paper does not measure or discuss prefill overhead.

CPU pipeline latency: The first few segments after the GPU starts decoding are processed only with the fast approximate compression, until the CPU catches up. If decodes are very short (few new tokens generated), the CPU may never finish before the request ends, and all tokens would use approximate compression throughout.

Head dimension assumption: The method assumes standard multi-head attention. For multi-query attention (MQA) or grouped-query attention (GQA) where many heads share KV vectors, the per-vector selection might be less effective since the vectors serve more diverse queries.

Single GPU evaluation: Results are on a single H800 GPU. In multi-GPU serving with tensor parallelism (where each GPU holds a shard of heads), the overhead of SVD rotation and CPU buffering per GPU shard is not characterized.

Batch serving interaction: The token selection step must be done per-request (since different requests have different queries). In batched serving, this means running per-request selection over a shared KV cache — the overhead scaling with batch size is not discussed.

Critical Assessment: Weaknesses & Improvements

Weaknesses and Flaws

1. Overclaimed generality — single model tested

The headline results (16× speedup, 7.3× throughput, 1.76% accuracy loss) are presented as general claims, but the primary experimental model is LLaMA-3.1-8B. The paper mentions “multiple LLM models” but does not tabulate individual model results, making it impossible to assess how much variance exists across architectures. A model with different head dimension, number of heads, or attention pattern (e.g., a model using GQA or MLA) could behave very differently.

Models with smaller head dimensions (e.g., $d = 64$ ) have less “room” for channel compression — retaining 30% of channels leaves only 19 channels per head. The sensitivity of accuracy to this is not characterized.

2. Baseline incompleteness — no evaluation against ShadowKV’s 2D partial approach

The paper criticizes InfiniGen (uses channel compression only for selection, not in the KV cache itself) and ThinK (applies channel compression to K only) for incomplete 2D coverage. But ShadowKV (Sun et al., 2025) is a stronger baseline that already achieves partial 2D compression with SVD and was published before this work. The comparison to ShadowKV is not shown directly in the throughput/accuracy tradeoff table — only in passing.

3. Missing ablation: per-vector selection overhead

Per-vector element selection requires a argtopk operation for every KV vector at every decode step for new tokens. For a batch of 32 requests each generating 1K tokens at 128 head dimensions, this is 32K × 128 = 4M argtopk calls. The paper does not report the overhead of this operation, nor compare it to the savings from compression. If the argtopk overhead is non-trivial at small context lengths where attention is not yet the bottleneck, MosaicKV might hurt latency in that regime.

4. Cherry-picked evaluation context range

The 16× speedup is the maximum achieved, presumably at very long contexts (likely 1M tokens). At shorter contexts (e.g., 32K tokens), where LLM deployments commonly operate, the benefit of the 2D approach over simpler sequence-only compression is less clear. The paper does not provide a context-length sensitivity curve, making it hard to judge where MosaicKV becomes worthwhile.

5. Accuracy metric granularity insufficient

Reporting a single 1.76% average accuracy loss across LongBench + RULER obscures task-level variation. Some LongBench subtasks (e.g., passage retrieval, multi-doc QA) are known to be highly sensitive to KV cache compression while others (summarization, code completion) are more tolerant. MosaicKV’s per-task accuracy breakdown is not provided, making it impossible to judge whether 1.76% is uniformly distributed or masks catastrophic degradation on specific tasks.

Limitations the Authors Understate

CPU bandwidth and memory costs not discussed: The CPU-side buffer requires copying raw KV vectors from GPU HBM to CPU RAM (via PCIe). At 256K tokens, the raw KV cache is ~256 GB, and PCIe bandwidth (peak ~32 GB/s for PCIe 5.0) would require ~8 seconds to transfer. Clearly the system does not transfer the full cache — only newly generated KV vectors are sent to CPU. But the paper does not quantify PCIe utilization or whether the CPU buffer ever becomes a bottleneck at high token generation rates.

Rotation matrix staleness: The SVD rotation matrices $R_k, R_v$ are computed from the prefill KV cache. As decoding proceeds and the KV cache grows, the optimal rotation directions may shift. The paper does not address whether the rotation matrices are ever recomputed, or how much accuracy is lost over very long generation sequences (e.g., 10K generated tokens after a 1M-token prompt) due to rotation staleness.

Memory overhead of index storage: Storing index arrays $\mathcal{I}_i$ per token adds overhead. For $n$ tokens each with $r$ selected elements, storing 2-byte indices requires $n \times r \times 2$ bytes. At 10% of 1M tokens ( $n = 100K$ ) with $r = 38$ (30% of $d = 128$ ), this is 100K × 38 × 2 = 7.6 MB per layer per head — non-trivial when multiplied over 32 layers and 32 heads = 7.8 GB just for indices. This “index tax” is not reported as part of the memory savings comparison.

Concrete Improvement Suggestions

1. Full model sweep with GQA and MLA: Test MosaicKV on Mistral-7B (uses GQA), Falcon-40B (uses MQA), and a DeepSeek model (uses MLA with latent-vector KV) to characterize how architectural variants affect the per-vector selection benefit. This would clarify whether the method is architecture-neutral or needs adaptation.

2. Cost breakdown by context length: Provide a latency breakdown at context lengths of 8K, 32K, 64K, 128K, 256K, 1M to show exactly where MosaicKV’s crossover point is. Users deploying at 32K contexts need to know if the overhead costs outweigh the benefits at that scale.

3. Index compression: The index arrays $\mathcal{I}_i$ consume significant memory. Since indices are from $[0, d)$ (e.g., [0, 127]), they need only 7 bits each. Packing them into nibbles or using delta encoding from the per-segment base indices could halve the index storage overhead.

4. Online rotation update: Maintain running statistics of the KV cache (e.g., exponential moving average of $K^\top K$ ) and periodically recompute the rotation matrix $R_k$ during decode. This would address rotation staleness for very long generation sequences.

5. Per-task accuracy breakdown: Publish per-task LongBench results to enable users to understand which applications benefit most and which might see unacceptable quality degradation.

6. Batched serving experiment: Evaluate MosaicKV under realistic batch serving conditions (batch size 16–64, mixed context lengths, prefix sharing across requests) to assess whether the per-request token selection creates scheduling complications in production serving systems.

Conclusion

MosaicKV makes a well-motivated and technically sound contribution to the long-context KV cache compression problem. Its central insight — that per-vector element selection dramatically outperforms global channel masking for combined 2D compression — is both intuitive and empirically validated.

The engineering design is equally thoughtful: the heterogeneous CPU/GPU buffering elegantly exploits the massive underutilization of non-HBM resources during decode-stage attention, turning a potential performance liability (expensive SVD computation) into a zero-overhead background job.

The results are impressive: 7.3× throughput gain and 16× attention speedup at 1.76% accuracy loss represents a substantial advance over prior single-dimension methods. If the accuracy results hold up under deeper per-task analysis and the method scales to other architectures, this is a compelling system for production deployment of long-context LLMs.

The main open questions are around generalization to diverse architectures, production batch serving behavior, and the hidden costs of index storage and CPU-GPU bandwidth — areas where follow-up work is needed.

References

Quest: Tang et al., 2024. “Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.”
ThinK: Xu et al., 2025. “ThinK: Thinner Key Cache by Query-Driven Pruning.”
ShadowKV: Sun et al., 2025. “ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.”
InfiniGen: Lee et al., 2024. “InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management.”
H2O: Zhang et al., 2023. “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.”
FlashAttention: Dao et al., 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.”
FlashInfer: Ye et al., 2025. “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.”
LongBench: Bai et al., 2024. “LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding.”
RULER: Hsieh et al., 2024. “RULER: What’s the Real Context Size of Your Long-Context Language Models?”