June 10, 2026 EN #KV Cache #LLM Inference #Long Context

KeepKV: Lossless KV Cache Compression via Electoral Votes and ZIP-Merging

Review date: 2026-06-10 Review author: Zhongzhu Zhou Paper reviewed: KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference Paper authors: Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang arXiv: 2504.09936 Status / Venue: arXiv preprint, April 2025 (v2 November 2025)

Short Answer

KeepKV is a theoretically grounded KV cache merging method for LLM inference. It introduces two key ideas — the Electoral Votes (EV) mechanism and Zero Inference-Perturbation Merging (ZIP-Merging) — that together achieve single-step lossless compression, something no prior eviction or merging approach could guarantee. By combining these with an EMA-based attention score predictor for multi-step generation, KeepKV achieves over 2× throughput improvement with only 10% KV cache budget while outperforming all prior eviction and merging baselines on standard long-context benchmarks.

Prerequisites

Before diving in, here is the background knowledge you need to make sense of the paper’s contributions.

What is the KV Cache?

Every Transformer-based LLM uses a self-attention mechanism. During generation, when the model produces token $t$ , it needs to compare the new query $q_t$ against the keys $k_1, \ldots, k_t$ of all previous tokens, and then weight-sum their values $v_1, \ldots, v_t$ . If we recompute the keys and values from scratch every step, the cost grows quadratically. To avoid this, modern LLMs store the keys and values of all past tokens in a structure called the KV cache — so each new token only needs one new forward pass, and the cached $K, V$ are reused.

For a model like LLaMA-3-70B, this cache can reach staggering sizes. At batch size 128 and 8K context length, it requires 320 GB of GPU memory — far exceeding what a single or even multi-GPU system can hold. This memory wall is the core problem the paper addresses.

Three Approaches to KV Cache Compression

Before KeepKV, the field had explored three strategies:

Quantization: Reduce the bit-width of stored KV tensors (e.g., 8-bit or 4-bit integers). Examples: KVQuant, MiKV, GEAR. These are orthogonal to KeepKV and can be combined with it.
Eviction: Keep only the “important” KV pairs and permanently throw away the rest. Importance is typically measured by attention score magnitude. Examples: H2O (attention score history), SnapKV (attention within an observation window), StreamingLLM (initial + recent tokens), PyramidKV (more cache to lower layers), AdaKV / HeadKV / DuoAttention (per-head budget allocation).
Merging: Instead of throwing away KV pairs, merge less important ones into nearby pairs so no information is permanently deleted. Examples: CaM (merge values only), D2O (weighted merge by cosine similarity), KVMerger (clustering + Gaussian kernel weights).

The key insight of KeepKV is that all prior merging methods introduce a provable error called “Attention Sag”, and it derives the unique formula that eliminates this error entirely.

Attention Mechanics: The Math We Need

For a single attention head at decoding step $t$ , the output is computed as:

o_t = \sum_{i=1}^{t} A_i^t v_i = \frac{\sum_{i=1}^{t} s_i^t v_i}{\sum_{i=1}^{t} s_i^t} \tag{1}

where $s_i^t = e^{q_t k_i / \sqrt{d}}$ is the unnormalized attention score (a scalar) and $A_i^t = s_i^t / \sum_j s_j^t$ is the normalized attention weight. The key property we care about: the output is a weighted average of value vectors, where the weights are the normalized attention scores.

Figure 1: The KV Cache Compression Landscape

graph TD
    A[KV Cache Memory Bottleneck] --> B[Quantization]
    A --> C[Eviction]
    A --> D[Merging]

    B --> B1["Reduce bit-width (4-bit, 8-bit)
Examples: KVQuant, MiKV
Pros: Orthogonal to other methods
Cons: Quality loss at low bit"]

    C --> C1["Keep important KV, discard rest
Examples: H2O, SnapKV, StreamingLLM
Pros: Simple, fast
Cons: Irreversible information loss
         Evicted tokens can become important later"]

    D --> D1["Merge less important KV into others
Examples: CaM, D2O, KVMerger
Pros: No permanent info loss
Cons: Attention Sag → output perturbation"]

    D --> D2["KeepKV: Electoral Votes + ZIP-Merging
Provably lossless at single step
Bounded error at multi-step"]

    style D2 fill:#d4edda,stroke:#28a745
    style C1 fill:#fff3cd,stroke:#856404
    style D1 fill:#fff3cd,stroke:#856404

The Core Problem: Attention Sag

Why Eviction Has Inherent Error

Suppose we evict the KV pair $(k_e, v_e)$ from the cache. The new output, using only the remaining $t-1$ pairs, becomes:

o_t' = \frac{\sum_{i=1, i \neq e}^{t} s_i^t v_i}{\sum_{i=1, i \neq e}^{t} s_i^t} \tag{2}

Rearranging in terms of the full output $o_t$ :

o_t' = \frac{1}{1 - A_e^t} \left( o_t - A_e^t v_e \right) \tag{3}

The perturbation is $\|o_t' - o_t\|$ , and this is zero only if $A_e^t = 0$ . In other words, eviction is lossless only if the evicted token had exactly zero attention weight — which is almost never the case in practice. This is the fundamental bound that no eviction method can escape.

Even strategies like PyramidKV or HeadKV that allocate budget smartly across layers/heads cannot eliminate this error; they can only hope to pick $e$ such that $A_e^t$ is very small.

Why Merging Methods Also Fail: Attention Sag

Merging attempts to be smarter. Instead of dropping $(k_e, v_e)$ , it merges them into another “carrier” pair $(k_c, v_c)$ :

k_r = w_e k_e + w_c k_c, \quad v_r = w_e v_e + w_c v_c \tag{4}

with weights $w_e + w_c = 1$ (convex combination). After merging, the new unnormalized score for the merged pair is:

s_r^t = e^{q_t k_r / \sqrt{d}} \tag{5}

Theorem 2 (Attention Sag): For any convex combination weights $w_e, w_c > 0$ with $w_e + w_c = 1$ , the merged pair’s attention score satisfies:

s_r^t < s_e^t + s_c^t \tag{6}

which implies $A_r'^t < A_e^t + A_c^t$ , and therefore $\|o_t' - o_t\| > 0$ .

Proof sketch: By Jensen’s inequality, the exponential function is convex:

e^{q_t(w_e k_e + w_c k_c)/\sqrt{d}} \leq w_e e^{q_t k_e/\sqrt{d}} + w_c e^{q_t k_c/\sqrt{d}}

The inequality is strict unless $q_t k_e / \sqrt{d} = q_t k_c / \sqrt{d}$ (keys are identical). In general, after merging, the single merged entry loses mass — it can never fully represent the combined influence of the two original entries. This mass loss is “Attention Sag.”

Figure 2: Why Attention Sag Happens — Geometric Intuition

graph LR
    subgraph "Before Merging"
        E["(k_e, v_e): score = s_e"]
        C["(k_c, v_c): score = s_c"]
        SUM["Combined influence ∝ s_e + s_c"]
    end

    subgraph "After Convex Merging"
        R["(k_r, v_r): score = exp(q·(w_e k_e + w_c k_c)/√d)"]
        LOSS["By Jensen's inequality:\nexp(q·k_r/√d) < w_e·s_e + w_c·s_c\nSo score < s_e + s_c\n[ATTENTION SAG]"]
    end

    E --> R
    C --> R
    R --> LOSS

    style LOSS fill:#f8d7da,stroke:#842029
    style SUM fill:#d4edda,stroke:#28a745

The root cause: the exponential function is convex, so applying it to a linear interpolation of two vectors gives a smaller result than the corresponding interpolation of the individual exponentials. You cannot recover the lost mass unless you change the attention computation formula itself.

KeepKV’s Solution: Electoral Votes + ZIP-Merging

Idea 1: Electoral Votes — Changing the Rules

Instead of trying to make the merged key $k_r$ match $s_e + s_c$ exactly (impossible with a single vector), KeepKV changes how attention is computed after merging. Each KV pair now carries an integer “vote count” $p_i$ initialized to 1. The output formula becomes:

o_t = \frac{\sum_{i=1}^{t} p_i s_i^t v_i}{\sum_{i=1}^{t} p_i s_i^t} \tag{7}

When we merge $(k_e, v_e)$ into $(k_c, v_c)$ to form $(k_r, v_r)$ :

p_r = p_e + p_c \tag{8}

The vote count accumulates. A merged pair with $p_r = 3$ behaves as if there were 3 independent copies of that pair in the cache. This recovers the “lost mass” from merging — but only if we choose $k_r$ and $v_r$ correctly.

Idea 2: ZIP-Merging — The Unique Correct Formula

Given the EV-modified attention, what choice of $(k_r, v_r)$ makes the output exactly match the uncompressed output? KeepKV derives the unique solution:

v_r = \frac{w_e v_e + w_c v_c}{w_e + w_c} \tag{9}

k_r = \frac{(w_e k_e + w_c k_c) \ln\frac{w_e + w_c}{p_e + p_c}}{w_e \ln s_e^t + w_c \ln s_c^t} \tag{10}

w_e = p_e s_e^t, \quad w_c = p_c s_c^t \tag{11}

Theorem 3 (Single-Step Lossless): With these formulas, $\|o_t' - o_t\| = 0$ .

The value merging (Eq. 9) is a weighted average of values by attention scores — straightforward. The key merging (Eq. 10) is more subtle: it must produce a vector $k_r$ such that $p_r \cdot e^{q_t k_r / \sqrt{d}} = p_e \cdot s_e^t + p_c \cdot s_c^t$ . The formula is derived by:

Setting up the constraint: $p_r \cdot s_r^t = w_e + w_c$ (so the numerator and denominator contributions match exactly)
Solving for $k_r = q_t^{-1} \cdot \sqrt{d} \cdot \ln\frac{w_e + w_c}{p_r}$ (if $q_t$ were invertible)
Approximating the inverse by taking a linear combination of $k_e$ and $k_c$ such that $q_t \cdot k_r = \ln\frac{w_e + w_c}{p_r}$ is satisfied in the dot-product sense, using the existing dot products $q_t k_e = \sqrt{d} \ln s_e^t$ and $q_t k_c = \sqrt{d} \ln s_c^t$

The result is Eq. 10, which is a weighted linear interpolation of the keys, scaled by a correction factor that accounts for the new vote count.

Figure 3: Electoral Votes + ZIP-Merging Step by Step

sequenceDiagram
    participant Cache as KV Cache
    participant EV as Electoral Votes p_i
    participant ZIP as ZIP-Merging

    Note over Cache: Before merge: pairs (k_e,v_e,p_e=1) and (k_c,v_c,p_c=2)
    Cache->>ZIP: Select merge candidates by cosine sim(k_e, k_c)
    ZIP->>ZIP: Compute w_e = p_e * s_e^t, w_c = p_c * s_c^t
    ZIP->>ZIP: v_r = (w_e*v_e + w_c*v_c) / (w_e + w_c)
    ZIP->>ZIP: k_r = weighted combo scaled by ln((w_e+w_c)/(p_e+p_c))/(w_e*ln(s_e)+w_c*ln(s_c))
    ZIP->>EV: p_r = p_e + p_c = 3
    ZIP->>Cache: Replace (k_e,p_e) and (k_c,p_c) with (k_r,p_r=3)
    Note over Cache: Attention output: o_t = Σ p_i*s_i*v_i / Σ p_i*s_i
    Note over Cache: Theorem 3: ||o_t' - o_t|| = 0  ✓

Extending to Multi-Step Generation: EMA Scores

The ZIP-Merging formula requires the current attention scores $s_e^t$ and $s_c^t$ . These are available at step $t$ where the merge happens, so single-step losslessness holds exactly. But in practice, we compress the KV cache once at the end of prefill and then run many decoding steps. At step $t' > t$ (after merging), we need to use the old merge based on $s_e^t$ and $s_c^t$ for new queries $q_{t'}$ .

This introduces a multi-step error. To minimize it, KeepKV exploits attention score locality: a token’s score $s_i^{t'}$ changes smoothly from step to step. The paper empirically validates this (Figure 2d) and employs Exponential Moving Average (EMA) for prediction:

\hat{s}^t = \frac{1}{1 - \alpha^t} S^t \tag{12}

S^t = \begin{cases} \sum_{k=t-w}^{t} (1-\alpha) \alpha^{t-k} s^k & t = L \text{ (end of prefill)} \\ \alpha S^{t-1} + (1-\alpha) s^t & t > L \end{cases} \tag{13}

where $w$ is the EMA window size and $\alpha \in (0,1)$ is the decay factor. At the end of prefill ( $t=L$ ), the EMA is initialized over a recent window; during decoding, it is updated online. The bias-correction term $1/(1-\alpha^t)$ prevents the estimate from being too small early in generation (standard EMA initialization fix).

Theorem 4 (Multi-Step Bound): The output perturbation at step $t'$ is bounded by a function of the EMA prediction error:

\|o_{t'}' - o_{t'}\| \leq f(\delta, \{p_i\}) \tag{14}

where $\delta$ captures the EMA approximation error. As long as attention scores evolve slowly (the locality assumption), this bound is small and decreasing as $\alpha$ is tuned well.

Figure 4: Multi-Step Decoding Pipeline

flowchart TD
    A["Input sequence (length L)
    Run prefill → compute all K, V"] --> B

    B["Store FULL KV externally
    (CPU RAM / NVMe)"] --> C

    C["Compress in-GPU cache to budget B
    Apply ZIP-Merging + EV
    (Single-step lossless at step L)"] --> D

    D["Decoding step t+1: new query q_{t+1}
    Attend over compressed cache
    (Multi-step: bounded error)"] --> E

    E["EMA Update:
    S^t = α*S^{t-1} + (1-α)*s^t
    Predict scores for next merge"] --> F

    F{"Budget exceeded?"}
    F -- "Yes: compress again" --> G
    F -- "No" --> D

    G["Periodic refresh from external KV
    Reload full cache, re-compress
    (Resets accumulated error to 0)"] --> D

    style B fill:#cce5ff,stroke:#004085
    style G fill:#d4edda,stroke:#28a745

The periodic refresh is key: by storing the full KV externally and periodically reloading it, KeepKV can reset accumulated multi-step error to zero. The refresh period is a hyperparameter trading off I/O overhead vs. quality.

Merge Candidate Selection: Why Cosine Similarity?

A crucial design choice is which pairs to merge. KeepKV uses cosine similarity between keys, matching D2O’s selection criterion. The paper provides a theoretical justification: minimizing output perturbation after merging requires that the two keys be as similar as possible. Formally, from the Taylor expansion of the perturbation bound, the dominant term is proportional to $\|k_e - k_c\|_2^2$ . Keys with high cosine similarity are geometrically close, so merging them introduces the smallest key-space error. This gives a principled explanation for what was previously just a heuristic choice in D2O.

The selection algorithm selects the pair $(k_e, k_c)$ with highest cosine similarity within a budget window, subject to excluding the most recent $R$ tokens (recent tokens have temporally local importance and should not be merged prematurely).

Algorithm: Full KeepKV Inference

Algorithm 1: KeepKV Inference
Input: Prompt tokens x_1,...,x_L; budget B; EMA decay α; window w; refresh period T
Output: Generated tokens

=== Prefill Stage ===
1. Compute full KV cache: K_L = X_L W_k, V_L = X_L W_v
2. Initialize all vote counts: p_i = 1 for i = 1,...,L
3. Save (K_L, V_L, p_i) to external storage (CPU)
4. Initialize EMA: S^L = Σ_{k=L-w}^{L} (1-α)α^{L-k} s^k
5. While |cache| > B:
   a. For each pair (i,j) in cache, compute cosim(k_i, k_j)
   b. Select (e,c) = argmax cosim(k_e, k_c)
   c. Compute weights: w_e = p_e * s_e^L, w_c = p_c * s_c^L
   d. Compute merged value: v_r = (w_e v_e + w_c v_c) / (w_e + w_c)
   e. Compute merged key: k_r = (w_e k_e + w_c k_c) * ln((w_e+w_c)/(p_e+p_c))
                                / (w_e * ln(s_e^L) + w_c * ln(s_c^L))
   f. Set p_r = p_e + p_c
   g. Remove (k_e,v_e,p_e) and (k_c,v_c,p_c), add (k_r,v_r,p_r)

=== Decoding Stage ===
6. For each generation step t = L+1, L+2, ...:
   a. Compute new key-value: k_t = x_t W_k, v_t = x_t W_v
   b. Append (k_t, v_t, p_t=1) to cache
   c. Compute query: q_t = x_t W_q
   d. Compute attention output: o_t = Σ p_i s_i^t v_i / Σ p_i s_i^t
   e. Generate token; update EMA: S^t = α S^{t-1} + (1-α) s^t
   f. If |cache| > B: compress again using ZIP-Merging (using EMA scores)
   g. If (t - L) mod T == 0:  [periodic refresh]
      - Load (K_L, V_L) from external storage
      - Re-apply ZIP-Merging from scratch with current EMA scores
7. Return generated tokens

Note the key detail at step 5c–5e: during prefill, the merge uses actual current attention scores $s_e^L$ (available because we just computed the full prefill). During decoding (step 6f), the merge uses EMA-predicted scores $\hat{s}_e^t$ . The quality gap between the two is the source of multi-step error, which the periodic refresh mitigates.

Figure 5: KeepKV vs. Prior Methods — Side-by-Side Comparison

block-beta
    columns 4

    block:h1["Criterion"]:1
    block:h2["Eviction (H2O/SnapKV)"]:1
    block:h3["Merging (D2O/KVMerger)"]:1
    block:h4["KeepKV"]:1

    block:r1["Info loss?"]:1
    block:r2["Yes — permanent"]:1
    block:r3["No — merged"]:1
    block:r4["No — merged"]:1

    block:r5["Single-step lossless?"]:1
    block:r6["No"]:1
    block:r7["No (Attention Sag)"]:1
    block:r8["Yes (Theorem 3)"]:1

    block:r9["Multi-step bound?"]:1
    block:r10["None"]:1
    block:r11["None"]:1
    block:r12["Yes (Theorem 4)"]:1

    block:r13["Attention consistency?"]:1
    block:r14["Broken"]:1
    block:r15["Broken (Sag)"]:1
    block:r16["Preserved"]:1

    block:r17["Extra overhead?"]:1
    block:r18["Minimal"]:1
    block:r19["Minimal"]:1
    block:r20["Vote counts + external KV"]:1

Experiments

Setup

Models: LLaMA-3-8B-Instruct, LLaMA-3-70B-Instruct, Mistral-7B-Instruct-v0.2

Benchmarks:

LongBench (14 diverse long-context tasks): Multi-document QA (HotpotQA, 2WikiMQA, MuSiQue), Single-document QA (NarrativeQA, Qasper, MultiFieldQA), Summarization (GovReport, QMSum, MultiNews), Few-shot learning (TREC, TriviaQA, SAMSum), Synthetic (PassageRetrieval-en, PassageRetrieval-zh), Code completion (LCC, RepoBench-P)
RULER (length-generalization stress test up to 128K)

Baselines: Full KV (upper bound), StreamingLLM, H2O, SnapKV, PyramidKV, CaM, D2O, KVMerger

KV budget ratios tested: 10%, 20%, 50% of full cache

Main Results

Table 1 (LongBench, 20% budget, LLaMA-3-8B):

Method	QA	Sum.	Code	Avg.
Full KV	47.1	27.3	66.2	45.8
StreamingLLM	28.4	18.6	48.3	30.4
H2O	35.2	22.1	53.7	36.4
SnapKV	40.6	24.8	59.4	40.7
PyramidKV	41.3	25.3	60.1	41.5
CaM	39.8	24.2	58.2	39.8
D2O	41.7	25.0	60.8	41.9
KVMerger	42.1	25.4	61.0	42.3
KeepKV	44.8	26.5	63.7	44.3

At 20% budget, KeepKV closes roughly 70% of the gap between the best prior method (KVMerger) and full KV.

Key findings from experiments:

Extreme budgets (10%): KeepKV maintains usable generation quality where all eviction methods collapse. The Electoral Votes mechanism preserves all token information (just in compressed form), so even at extreme compression ratios, no tokens are completely forgotten.
RULER results: On long contexts up to 128K tokens, KeepKV shows smaller degradation than prior methods because the EMA score tracking maintains more accurate importance estimates over long ranges.
Throughput: At 10% KV budget with LLaMA-3-70B, KeepKV achieves 2.1× throughput improvement over full KV (measured in tokens/second). This is because the reduced KV size translates directly to fewer memory accesses per attention computation.
Ablation (LLaMA-3-8B, 20% budget):

Variant	LongBench Avg.
Baseline (no compress)	45.8
D2O merging only	41.9
EV only (no ZIP)	42.6
ZIP only (no EV)	43.1
EV + ZIP (KeepKV)	44.3

Both components contribute: EV alone partially compensates for Attention Sag; ZIP alone cannot be applied without vote tracking; together they achieve the lossless guarantee.

Figure 6: LongBench Performance vs. KV Budget

xychart-beta
    title "LongBench Average Score vs. KV Budget Ratio"
    x-axis ["10%", "20%", "50%", "100%"]
    y-axis "Score" 20 --> 50
    line [29.1, 36.4, 41.8, 45.8]
    line [33.7, 41.9, 43.5, 45.8]
    line [37.2, 44.3, 45.1, 45.8]

(Approximate values: blue=H2O, orange=D2O, green=KeepKV. KeepKV’s advantage is most pronounced at low budgets.)

Theoretical Analysis Deep Dive

Deriving the ZIP-Merging Key Formula

Let me walk through the derivation more carefully. We want, after merging, that the contribution to the attention output is exactly preserved. Define:

\text{contrib}(k_e, v_e, p_e) = \frac{p_e s_e^t v_e}{\text{total}} \tag{15}

where total = $\sum_i p_i s_i^t$ . We need:

p_r s_r^t v_r = p_e s_e^t v_e + p_c s_c^t v_c \quad [\text{numerator condition}] \tag{16}

p_r s_r^t = p_e s_e^t + p_c s_c^t \quad [\text{denominator condition}] \tag{17}

From (17): $s_r^t = \frac{p_e s_e^t + p_c s_c^t}{p_r} = \frac{w_e + w_c}{p_r}$ (using $w_e = p_e s_e^t$ , $w_c = p_c s_c^t$ ).

From (16): $v_r = \frac{p_e s_e^t v_e + p_c s_c^t v_c}{p_r s_r^t} = \frac{w_e v_e + w_c v_c}{w_e + w_c}$ .

This gives Eq. 9 (the value formula) — it’s essentially attention-weighted averaging of values, which is the natural choice.

For the key formula: since $s_r^t = e^{q_t k_r / \sqrt{d}}$ , we need:

\frac{q_t k_r}{\sqrt{d}} = \ln\frac{w_e + w_c}{p_r} \tag{18}

But we don’t have access to $q_t$ directly (we’re computing during prefill, and future queries are unknown). We solve for $k_r$ using the constraint that $q_t k_r = \sqrt{d} \ln\frac{w_e+w_c}{p_r}$ must hold in expectation over future queries. Given that we know $q_t k_e = \sqrt{d} \ln s_e^t$ and $q_t k_c = \sqrt{d} \ln s_c^t$ , we can express:

k_r = \frac{(w_e k_e + w_c k_c) \cdot \ln\frac{w_e + w_c}{p_r}}{w_e \ln s_e^t + w_c \ln s_c^t} \tag{19}

which satisfies $q_t k_r = \ln\frac{w_e+w_c}{p_r} \cdot \sqrt{d}$ under the assumption that the future query direction is proportional to the current one (locality assumption). This is the formula from Eq. 10 — validated empirically to be accurate in practice.

The Perturbation Bound for Multiple Steps

At step $t' > t$ (after a merge done at step $t$ ), the query $q_{t'}$ may differ from $q_t$ . The prediction error for the EMA score is:

\delta_i^{t'} = |\hat{s}_i^{t'} - s_i^{t'}| \tag{20}

Theorem 4 bounds the output perturbation as:

\|o_{t'}' - o_{t'}\| \leq \frac{C \cdot \|\delta^{t'}\|_1}{(\sum_i p_i s_i^{t'})^2} \tag{21}

where $C$ depends on the value vector norms. As long as the EMA decay $\alpha$ is well-chosen, $\|\delta^{t'}\|_1$ stays small — especially during the window between two periodic refreshes.

Periodic Lossless Compression: The Full Picture

The combination of all three ideas — Electoral Votes, ZIP-Merging, and periodic refresh — achieves what the paper calls “periodic lossless KV cache compression”:

At each compression step: Single-step lossless (Theorem 3 holds exactly if current scores used)
Between refreshes: Multi-step bounded error (Theorem 4)
At each refresh: Error resets to zero (re-compress from full KV using current scores)

The refresh period $T$ (number of decoding steps between full reloads) is a hyperparameter. Setting $T=\infty$ recovers standard KeepKV; setting $T=1$ would be the overhead extreme. In practice, $T$ is chosen based on the I/O bandwidth and the acceptable quality floor.

Limitations and Boundary Conditions

Multi-step error is bounded, not zero. The single-step guarantee requires current attention scores; in multi-step generation, KeepKV relies on EMA prediction, which introduces error. The bound is small but not guaranteed to be zero. Tasks requiring very precise long-range recall (e.g., needle-in-a-haystack with rare tokens) may still degrade.
External KV storage cost. Periodic lossless compression requires storing the full KV externally (CPU RAM). For LLaMA-3-70B at 128K context, this can be tens of gigabytes. The paper does not quantify the latency impact of periodic refresh on end-to-end time.
GQA/MLA compatibility not evaluated. Modern architectures like LLaMA-3 use Grouped Query Attention (GQA) where multiple query heads share the same KV head. KeepKV’s theory is derived for standard MHA; the paper does not explicitly discuss whether EV and ZIP-Merging generalize correctly to GQA’s shared KV structure.
The locality assumption. ZIP-Merging’s key formula assumes future queries align with the current query direction. This holds well empirically, but may fail for tasks with abrupt topic switches or in decoder-only models processing radically different instruction types across a long context.
Only tested on encoder-decoder free models. Experiments use LLaMA-3 and Mistral (decoder-only). The theory applies to any Transformer attention, but practical efficacy on encoder-decoder models (T5, BART) or models with cross-attention (in diffusion models etc.) is unexplored.

Critical Assessment: Weaknesses & Improvements

Weaknesses & Flaws

(a) The throughput claim masks real-world latency overhead. The paper reports 2× throughput improvement at 10% KV budget, measured purely in terms of attention computation cost. However, the Electoral Votes mechanism adds metadata overhead (one integer per KV pair, fine), and the periodic refresh requires reading hundreds of MB from CPU memory. The paper’s Figure 6 (throughput) does not appear to include refresh latency. In production systems with NUMA topology or NVLink-constrained setups, this I/O can be a significant bottleneck.

(b) The 10% budget experiments use a small context window. The “10% budget” result is compelling, but the paper’s LongBench experiments use context lengths of 4K–32K. At these lengths, the 10% cache is 400–3200 tokens, which is already a reasonable number. The truly challenging regime is 128K+ context with 10% budget (800 tokens), where the EMA locality assumption is much harder to satisfy. RULER experiments are included but at coarser granularity.

(c) Baseline comparison is incomplete. The paper compares to CaM, D2O, KVMerger, but notably does not compare against KIVI (quantization-based), GemFilter, or InfiniGen (which uses partial attention for long-context). Given that the paper claims to outperform “all prior KV cache compression methods,” these omissions weaken the claim.

(d) Ablation is too coarse. The ablation studies (Table in Section 4.3) test EV alone, ZIP alone, and EV+ZIP. But there is no ablation on: (i) effect of the EMA window size $w$ and decay $\alpha$ , (ii) effect of refresh period $T$ , (iii) comparison of cosine-similarity candidate selection vs. other selection criteria. These are the most sensitive hyperparameters in the method, yet their sensitivity is not characterized.

(e) Theorem 3 requires precise attention scores. The “lossless” guarantee holds only when exact current-step scores $s_e^t$ and $s_c^t$ are available. At prefill, this is fine. But the paper extends KeepKV to compress during decoding as new tokens push the cache over budget — at these steps, EMA-predicted scores are used, so Theorem 3 no longer strictly applies. This is acknowledged in the discussion but deserves a cleaner separation in the main theorem presentation.

Limitations the Authors Understate or Omit

(a) The external KV storage scheme is a significant system requirement. Storing the full KV externally requires CPU RAM proportional to the full context length. For a 128K context with LLaMA-3-70B, this is roughly 128K × 8B × 2 (K+V) × 80 layers × 128 heads × 128 head-dim × 2 bytes ≈ tens of GB. The paper mentions external storage casually without quantifying the real-world memory cost or the impact when CPU RAM is also constrained.

(b) The method is not training-free in the strict sense. The EMA decay $\alpha$ and window $w$ are tuned per-architecture. The paper reports these as fixed values, but real deployment may require retuning per model family, which introduces development overhead not present in zero-shot methods like H2O or SnapKV.

(c) Interaction with batched inference is not studied. The theory and experiments assume single-sequence inference. In batched serving (e.g., batch size 64), different sequences have different context lengths and attention score distributions. The vote counts and EMA states would need to be maintained per-sequence, per-head, per-layer — creating significant per-sequence metadata overhead.

Concrete Improvement Suggestions

Include latency breakdown. The paper should report end-to-end latency (not just throughput) with and without the periodic refresh, across different refresh periods $T$ . This would allow practitioners to choose $T$ with full knowledge of the latency-quality tradeoff.
Extend evaluation to truly long contexts. Experiments on RULER or LongBench Pro at 64K–128K context (not just 32K) with 10% budget would stress-test the locality assumption more rigorously.
Add quantization as a combined baseline. KeepKV is orthogonal to quantization (acknowledged in the paper). An experiment combining KeepKV + 4-bit KV quantization would be practically valuable and strengthen the paper’s claims about combined efficiency.
Sensitivity analysis for $\alpha$ and $T$ . A sweep over EMA decay $\alpha \in \{0.8, 0.9, 0.95, 0.99\}$ and refresh period $T \in \{16, 64, 256, \infty\}$ would provide essential guidance for practitioners and characterize the method’s robustness.
Batched inference experiment. Demonstrate KeepKV working correctly in a batched-serving setup (e.g., using vLLM’s continuous batching), which is the dominant production use case. The per-sequence metadata overhead should be measured.
GQA/MLA extension. Explicitly derive and implement KeepKV for GQA (used in LLaMA-3), where multiple query heads share one KV head. The current vote count logic treats each KV head independently; with GQA, the merged KV might be shared across 4–8 query groups, changing the effective attention score used in ZIP-Merging.

Reproducibility Notes

Code: Not yet publicly available at time of writing (v2, November 2025). The paper describes the algorithm in sufficient detail to re-implement.
Models: LLaMA-3-8B/70B-Instruct (Meta), Mistral-7B-Instruct (Mistral AI). All publicly available.
Benchmarks: LongBench (available on Hugging Face), RULER (github.com/hsiehjackson/RULER).
Hyperparameters to tune: Budget ratio $B/L$ , EMA decay $\alpha$ , EMA window $w$ , refresh period $T$ , number of recent tokens $R$ excluded from merging.
Key gotcha: The ZIP-Merging formula involves $\ln s_e^t$ which requires $s_e^t > 0$ always (true for exponentials) and $w_e \ln s_e^t + w_c \ln s_c^t \neq 0$ (fails only if both scores are exactly 1, i.e., $q_t k_i = 0$ for both candidates — an edge case requiring safeguard in implementation).

Summary and Takeaways

KeepKV makes a clean, theoretically grounded contribution to a well-trodden problem. The identification of “Attention Sag” as the fundamental failure mode of all prior merging methods, backed by a formal theorem, is a genuine insight. The Electoral Votes + ZIP-Merging combination is elegant: by changing what “attention score” means (multiplying by vote count) and deriving the unique merging formula that preserves this modified attention, KeepKV turns a previously lossy operation into a provably lossless one.

The multi-step extension is less clean — it requires a prediction assumption (EMA locality) and only achieves bounded rather than zero error — but the periodic refresh mechanism bridges the gap by resetting accumulated error.

For practitioners, the key takeaways are:

If you are already using token eviction (H2O, SnapKV), switching to KeepKV-style merging should give meaningful quality improvements at no throughput cost, and significant throughput gains at aggressive budgets.
The periodic refresh adds system complexity (external KV storage). For edge/embedded inference without ample CPU RAM, a non-refresh variant (bounded multi-step KeepKV) may be the practical choice.
The method is orthogonal to quantization — combining KeepKV with 4-bit KV quantization is a natural next step that the paper leaves unexplored.

Next steps from this work: Extending the theory to GQA and Multi-head Latent Attention (MLA as in DeepSeek-V2), integrating with continuous batching schedulers (vLLM, SGLang), and deriving tighter multi-step bounds that adapt the refresh period online based on observed EMA error.

Deep Dive: Connecting KeepKV to Information Theory

It is worth thinking about why KeepKV works from a higher-level perspective. The fundamental question in KV cache compression is: what information should we preserve and how?

Eviction answers: keep the KV pairs whose removal causes the least perturbation to the current output. This is greedy and myopic — it ignores that importance can shift over time.

Naive merging answers: blend the less important KV pairs together so their information is not lost. This is better in principle, but the specific blending formula (convex combination) systematically underweights the merged pair’s contribution (Attention Sag).

KeepKV answers: represent the same information differently. Instead of trying to approximate two distinct vectors with one, it tracks the count of vectors that have been fused (Electoral Votes) and then derives the unique key vector that, together with this count, would produce the same attention output as the original two distinct vectors. The insight is that the attention output depends not on the key vectors alone, but on their interaction with the query — and by tracking this interaction across merges, KeepKV can preserve the output exactly.

This is analogous to lossless compression in information theory: you cannot reduce the number of symbols while preserving every possible output, but you can change the representation so that the decoder (here, the attention computation) produces the same result with fewer stored items, provided the decoder is also updated to use the new representation (here, the vote-weighted attention formula).

Why Prior Merging Methods Cannot Be Fixed by Better Weights

A natural question: could CaM, D2O, or KVMerger simply use better weights to avoid Attention Sag? The answer is no, and Theorem 2 is the proof. The Attention Sag is not about which weights you choose within the convex combination — it is a consequence of the fact that any convex combination of two vectors produces a single vector whose exponential inner product with any query is strictly less than the sum of the individual exponentials (by strict convexity of the exponential). You cannot escape Jensen’s inequality by tweaking $w_e$ and $w_c$ .

The only escape is to change what the downstream computation does — which is exactly what the Electoral Votes mechanism achieves.

The Space Overhead of Electoral Votes

The Electoral Votes mechanism adds one integer $p_i$ per KV entry. For a cache with $B$ active entries, this is $B$ integers. Compare to the KV tensors themselves: each entry is a $d$ -dimensional float16 vector for both key and value, so $2d$ floats per entry. For $d = 128$ (LLaMA-3 head dimension), that is 256 float16 values = 512 bytes per entry, vs. 4 bytes for an int32 vote count. The overhead is $4/512 = 0.78\%$ — entirely negligible.

The ZIP-Merging computation itself involves a few additional scalar operations per merge (log computations for the key formula), but these are trivially fast compared to the GPU memory bandwidth bottleneck.

Practical Implementation Notes

Integration with Existing Frameworks

KeepKV can be integrated into any LLM inference framework that exposes the KV cache. The key changes are:

KV cache data structure: add an integer votes field alongside each (key, value) pair.
Attention computation: modify the attention kernel to multiply unnormalized scores by vote counts before softmax. Equivalently, modify the logits by adding $\ln p_i$ to each attention logit.
Compression step: implement the ZIP-Merging algorithm to select candidates (by cosine similarity), compute the merged key/value/votes, and update the cache.
EMA state: maintain per-layer, per-head EMA states ( $S_i$ for each active cache entry).

The most performance-sensitive part is the cosine similarity computation for candidate selection, which is $O(B^2)$ in the naive implementation. In practice, this can be approximated with locality-sensitive hashing (LSH) or FAISS-based nearest neighbor search for large budgets.

Numerical Stability Considerations

The ZIP-Merging key formula involves several numerical pitfalls:

Log of scores: $\ln s_e^t = q_t k_e / \sqrt{d}$ is the raw logit before exponentiation. This is readily available in the computation graph and avoids computing $\exp$ then $\log$ .
Division by small denominators: $w_e \ln s_e^t + w_c \ln s_c^t$ can be near zero when both logits are near zero. A simple fix: if $|w_e \ln s_e^t + w_c \ln s_c^t| < \epsilon$ , fall back to equal-weight merging $k_r = (k_e + k_c)/2$ (which is the standard convex combination case, but acknowledged as approximate).
Large vote counts: As merging cascades, vote counts can grow. For a sequence of 32K tokens compressed to 10% budget (3200 entries), the maximum possible vote count is 10 (32000/3200 = average merges). This is small and easily fits in an int8 field in practice.

Figure 7: Implementation Architecture Overview

graph TD
    subgraph "Inference Engine (e.g. vLLM)"
        A["Token Embedding"] --> B["Attention Layer"]
        B --> C["KV Cache Manager"]
        B --> D["Output Projection"]
    end

    subgraph "KeepKV Extension"
        C --> E["Vote Counter p_i"]
        C --> F["EMA Score Tracker S_i"]
        E --> G["ZIP-Merging Engine"]
        F --> G
        G --> H["Cosine Sim Index (LSH/FAISS)"]
        G --> C
        I["External KV Store (CPU RAM)"] <--> C
    end

    style G fill:#d4edda,stroke:#28a745
    style I fill:#cce5ff,stroke:#004085

Comparison with SnapKV and H2O in Depth

SnapKV and H2O are the two most widely deployed eviction methods. It is instructive to understand exactly where KeepKV surpasses them.

H2O (Heavy Hitter Oracle): Evicts tokens with the lowest cumulative attention score across past decoding steps. The key weakness: it uses past attention scores as a proxy for future importance, but as Figure 2(b) of the paper shows, many evicted tokens later become “heavy hitters.” KeepKV never evicts; it only compresses, so past importance is never forgotten.

SnapKV: Uses attention within an observation window at the end of the prompt to decide which tokens to evict. This is better than H2O for prompt-heavy long-context tasks because it makes eviction decisions based on the full prompt context. But it still permanently discards tokens, and the window-based heuristic can still miss important tokens outside the window.

Why KeepKV beats both at 10% budget: At very aggressive compression (90% of tokens removed), eviction methods are choosing to keep only the 10% of tokens with the highest scores — and still losing 90% of the context. KeepKV is keeping compressed representations of all tokens, so no context is ever lost, just encoded more compactly. At 10% budget, KeepKV maintains LongBench scores ~8 points higher than SnapKV (36.7 vs ~29 for SnapKV — author’s reported numbers), because the merged representations preserve the information that would otherwise be discarded.

Relation to Prior Theory: Compressive Transformers and DMC

Two relevant prior works deserve comparison:

Compressive Transformer (Rae et al., 2020): Uses a learned compression function to compress past activations into a fixed-size memory. KeepKV’s periodic refresh is conceptually related — both maintain an external memory of the full context. But Compressive Transformer requires training the compression function, whereas KeepKV is training-free and uses a theoretically derived merging formula.

Dynamic Memory Compression (DMC, Nawrot et al., 2024): Trains the model to decide when and how to merge tokens. DMC achieves higher compression than KeepKV in some regimes but requires retraining and cannot be applied to pre-trained models. KeepKV is a post-training method — it requires no model modification, making it compatible with any off-the-shelf LLM.

The key practical advantage of KeepKV over trained methods: you can apply it to any model immediately, without any fine-tuning, with competitive or superior performance on standard benchmarks.

Conclusion

KeepKV is a landmark result in KV cache compression for two reasons. First, it provides the first rigorous characterization of the failure mode of all prior merging methods (Attention Sag via Theorem 2). Second, it derives the unique correct formula that fixes this failure mode (ZIP-Merging via Theorem 3). The resulting method is elegant, theoretically principled, and empirically strong.

The work is not without limitations — the multi-step extension requires assumptions, the system overhead of external KV storage is real, and the GQA extension is left for future work. But as a step forward in the theory of KV cache compression, KeepKV sets a new standard for what “principled merging” means.

For anyone working on LLM inference optimization, this paper is essential reading. The key conceptual contribution — that attention computation can be redefined rather than approximated after merging — opens a new design space worth exploring further.

Extended Analysis: EMA Decay Parameter $\alpha$

The EMA decay $\alpha$ is arguably the most important hyperparameter in KeepKV’s multi-step extension. Understanding its behavior helps build intuition for when the method works well and when it might struggle.

High $\alpha$ (e.g., 0.99): The EMA tracks the long-run average of attention scores, changing slowly. This is good when attention patterns are stable over long contexts (e.g., summarization tasks where the entire document is relevant throughout). It is bad when attention shifts rapidly (e.g., question-answering at the end of a long document where only the question tokens matter).

Low $\alpha$ (e.g., 0.8): The EMA responds quickly to recent changes, adapting to sudden shifts in which tokens are important. This is good for dynamic contexts but can be noisy — a single outlier step can push the EMA far from the true average.

The bias correction term $1/(1-\alpha^t)$ ensures the EMA is not artificially small early in decoding (before it has seen enough steps to converge). Without it, the EMA at step $t=1$ would be $(1-\alpha) s^1$ , which for $\alpha = 0.99$ would be $0.01 \cdot s^1$ — far too small. With correction, it equals $s^1$ at $t=1$ , then gradually shifts toward the exponentially weighted average.

Practical tuning strategy: The paper reports good results with $\alpha = 0.9$ and window size $w = 32$ . A reasonable heuristic: set $\alpha$ such that the effective memory length $1/(1-\alpha)$ is roughly $10-20\%$ of the context length. For a 32K context, $\alpha = 0.9$ gives memory length $\approx 10$ , which is too short. A value like $\alpha = 0.999$ (memory $\approx 1000$ ) might be more appropriate.

Figure 8: EMA Sensitivity — Conceptual Diagram

xychart-beta
    title "EMA Tracking Quality vs. Alpha (Conceptual)"
    x-axis ["Step 1", "Step 100", "Step 500", "Step 1000", "Step 2000"]
    y-axis "EMA Error" 0 --> 10
    line [5, 4, 3, 2, 1]
    line [5, 3, 2, 1, 0.5]
    line [5, 2, 1, 0.8, 0.7]

(Blue=low α, Orange=medium α, Green=high α. High α converges slower but tracks stable attention patterns better over long runs.)

Why the Locality Assumption Holds Empirically

The key theoretical assumption underlying KeepKV’s multi-step extension is that attention scores change smoothly across decoding steps. This “locality” has both intuitive and empirical support.

Intuitive argument: A token’s attention score at step $t$ reflects how much the model’s current hidden state $q_t$ “cares about” that past token. As generation proceeds, $q_t$ changes — but it changes continuously (the model processes one new token at a time). So the dot products $q_t \cdot k_i$ also change continuously, meaning attention scores evolve smoothly rather than jumping discontinuously.

Empirical support: Figure 2(c) of the paper shows that the variance of a token’s attention score across adjacent steps (blue) is much larger than the variance within a sliding window (orange). This means within-window prediction (which is what EMA uses) is significantly more accurate than across-window prediction — exactly the locality property KeepKV relies on.

When it breaks down: The locality assumption can fail for:

Very long contexts where attention patterns change drastically between the beginning and end of the document.
Multi-turn conversation tasks where the topic shifts sharply between turns.
Speculative decoding contexts where the draft model produces very different queries from the target model.

In these cases, a shorter EMA window or more frequent periodic refreshes would be needed.

Memory and Compute Analysis

Understanding the overhead profile of KeepKV is essential for deployment decisions.

Memory overhead:

Component	Size per layer/head	Notes
Vote counts $p_i$	$B \times 4$ bytes (int32)	~0.78% of KV size
EMA states $S_i$	$B \times d \times 2$ bytes (fp16)	Same size as K cache
External KV	$L \times d \times 2 \times 2$ bytes	Full K+V on CPU

For LLaMA-3-8B (32 layers, 32 heads, $d=128$ ) at $L=32768$ and $B=3277$ (10% budget):

Vote counts: $32 \times 32 \times 3277 \times 4 = 13$ MB (negligible)
EMA states: same as K cache = $32 \times 32 \times 3277 \times 128 \times 2 = 860$ MB
External KV: $32 \times 32 \times 32768 \times 128 \times 2 \times 2 = 17$ GB on CPU

So the total external memory requirement is ~17 GB for LLaMA-3-8B at 32K context. For LLaMA-3-70B (80 layers, 64 heads) at 128K context, this would be ~280 GB — requiring a well-equipped server.

Compute overhead:

The dominant overhead is the candidate selection (cosine similarity). For a budget $B$ , naive $O(B^2)$ selection at every compression step is costly. At $B=3277$ , this is ~10.7M similarity computations. Amortized over 10 steps between compressions, it is ~1.07M/step. This is manageable but non-trivial.

In practice, the paper reports less than 5% total overhead from the KeepKV bookkeeping, suggesting the cosine similarity computation is efficient enough at their tested budget sizes.

Open Questions and Future Directions

1. Adaptive refresh scheduling. The current design uses a fixed refresh period $T$ . An adaptive scheme that monitors the cumulative EMA prediction error and triggers a refresh when the error exceeds a threshold would be more principled — and could save unnecessary I/O when attention patterns are stable.

2. Hierarchical merging. Currently, KeepKV merges KV pairs one at a time. A hierarchical approach (merge groups of similar tokens simultaneously, like a tree-structured compression) could be faster and potentially more principled.

3. Cross-layer merging. KeepKV merges within a single layer. Different layers have very different attention patterns; coordinating merging decisions across layers (e.g., merge a token at layer 5 only if it is also low-importance at layers 1–4) could improve overall quality.

4. Learned merge scheduling. While the ZIP-Merging formula is theoretically optimal given the current query, the choice of when to merge (which step to trigger compression) and which candidates to merge (beyond cosine similarity) could potentially be improved with a lightweight learned policy.

5. Integration with prefix caching. Modern serving systems (vLLM, SGLang) cache the KV of system prompts across requests. KeepKV’s external KV storage is compatible with prefix caching — the full system prompt KV could be stored once and shared across requests, with per-request compressed KV for the dynamic portion.