May 29, 2026 EN #Model Compression #SVD & Low-Rank #LLM Inference

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

Review date: 2026-05-29 Review author: Zhongzhu Zhou Paper reviewed: IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression Paper authors: Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri (Vanderbilt University, UC Davis) arXiv: 2605.15626 Status / Venue: arXiv preprint, May 2026

Short Answer

IO-SVD improves SVD-based post-training LLM compression by replacing one-sided activation whitening with a KL-aware double-sided whitening objective, pairing it with greedy heterogeneous rank allocation scored by first-order calibration loss, and adding loss-aware quantization remapping for hybrid SVD–int8 compression. Together these three ideas push perplexity on LLaMA-7B from 7.94 (SVD-LLM) down to 5.59 at 80% parameter retention — a gap that, at 60% retention, widens to nearly 4× improvement over ASVD.

Prerequisites

Before diving in, let me lay out the background knowledge you need to follow the mathematics and engineering decisions in this paper.

Singular Value Decomposition (SVD)

Any real matrix $W \in \mathbb{R}^{m \times n}$ can be written as

$W = U \Sigma V^\top$

where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal with non-negative entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ called singular values.

The Eckart–Young–Mirsky theorem guarantees that the best rank- $r$ approximation (in Frobenius or spectral norm) to $W$ is

$\hat{W}_r = U_{:,1:r}\, \Sigma_{1:r,1:r}\, V_{:,1:r}^\top$

i.e., just keep the top $r$ singular values and their singular vectors. The approximation error is $\sum_{i>r} \sigma_i^2$ in squared Frobenius norm.

In LLM compression, we apply this idea to every linear weight matrix: replace each $W \in \mathbb{R}^{m \times n}$ with a product $A B^\top$ where $A \in \mathbb{R}^{m \times r}$ , $B \in \mathbb{R}^{n \times r}$ , $r \ll \min(m,n)$ . If the compression ratio is $\rho = r(m+n)/(mn)$ , then at $\rho = 0.8$ we’re keeping 80% of the parameters.

Why Vanilla SVD Falls Short for LLMs

The plain truncated SVD minimizes weight reconstruction error $\|W - \hat W\|_F^2$ . But in a neural network, weight perturbations matter only through their effect on the model’s output. Two perturbations with the same Frobenius norm but different alignment with the network’s active subspaces can have very different effects on loss.

Intuition: If the network rarely activates certain directions in weight space (because the input distribution never excites them), large singular values in those directions are essentially noise for inference. Vanilla SVD keeps them anyway because they have large $\sigma_i$ , while discarding small singular values that might be in highly “active” directions.

What is Whitening?

Whitening is a linear transformation that makes a distribution isotropic (unit-covariance, zero-mean). If input activations $x$ have covariance $R = \mathbb{E}[xx^\top]$ , whitening maps $x \mapsto R^{-1/2} x$ so the transformed vector has identity covariance. After whitening, Euclidean distance is a better proxy for how much the network “cares” about a difference in that direction.

In SVD compression, the idea is: instead of minimizing $\|W - \hat W\|_F^2$ in the raw weight space, minimize the activation reconstruction error:

$\|(W - \hat W) X\|_F^2 = \|W R^{1/2} - \hat W R^{1/2}\|_F^2 \quad (\text{for } R = XX^\top)$

This is equivalent to doing SVD on the whitened matrix $W R^{1/2}$ , then unwhitening the result.

KL Divergence and Second-Order Taylor Expansion

The KL divergence $\text{KL}(p \| q)$ measures how much a distribution $q$ differs from $p$ :

$\text{KL}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i}$

For LLMs, the natural objective when compressing is to keep the model’s predictive distribution $q_\theta(\cdot | x)$ as close as possible to the uncompressed model $q_{\theta_0}(\cdot | x)$ .

A second-order Taylor expansion of $\text{KL}(p \| \text{softmax}(z + \delta z))$ around $\delta z = 0$ gives:

$\text{KL}(p \| \text{softmax}(z + \delta z)) \approx \frac{1}{2} \delta z^\top H \delta z$

where $H = \text{Diag}(p) - pp^\top$ is the Hessian of the softmax cross-entropy (the softmax Jacobian). This quadratic form tells us which directions in logit space are most sensitive to perturbation.

Fisher Information and Kronecker Factoring

For a parametric model with parameters $\theta$ , the Fisher information matrix measures the curvature of the log-likelihood. For matrix-shaped parameters (like transformer weight matrices), it can often be approximated by a Kronecker product of two smaller matrices — one capturing input statistics, one capturing output (gradient) statistics. This is the key insight behind methods like K-FAC (Kronecker-Factored Approximate Curvature).

IO-SVD derives its whitening matrices from a Kronecker-factored approximation to the Hessian of the KL loss, making it information-geometry-aware.

Transformer Linear Layers and Low-Rank Compression

In a transformer, most parameters live in linear layers: query/key/value/output projections in attention, and up/gate/down projections in MLP blocks. For a model with $L$ layers and hidden dimension $d$ :

Attention projections: $W_Q, W_K, W_V, W_O \in \mathbb{R}^{d \times d}$
MLP up/down projections: $W_{up} \in \mathbb{R}^{d \times 4d}$ , $W_{down} \in \mathbb{R}^{4d \times d}$

Low-rank compression replaces each $W$ with $AB^\top$ , reducing per-layer parameter count from $mn$ to $r(m+n)$ . The storage break-even rank is $r^* = \lfloor mn/(m+n) \rfloor$ — above this rank, storing two factors is actually more expensive than the dense matrix.

The IO-SVD Method

IO-SVD has three interlocking components. Let me walk through each in detail.

Component 1: KL-Aware Double-Sided Whitening

The Problem with One-Sided Whitening

Existing methods like SVD-LLM whiten the weight matrix by pre-multiplying with the Cholesky factor of the input activation covariance:

$\tilde{W} = W R^{1/2}, \quad R = \mathbb{E}[xx^\top]$

They then do SVD on $\tilde{W}$ , drop small singular values, and undo the whitening. This captures input activation geometry but ignores how errors in the layer’s output affect the downstream loss.

The Alternative that Fails: You could try to use output gradients directly, but the output Jacobian $J_t = \partial z_t / \partial h_t$ for a token $t$ (mapping from layer output $h_t$ to logits $z_t$ ) has dimension $V \times d_{out}$ where $V \sim 32000$ is the vocabulary size — too large to form explicitly.

IO-SVD’s Solution: Derive the Output Metric from KL

For a target layer $\ell$ with weight $W_\ell$ and compression $\hat{W}_\ell$ , the perturbation is $\Delta W_\ell = W_\ell - \hat{W}_\ell$ . This changes layer output by $\delta h_t = \Delta W_\ell x_t$ , which propagates to logit perturbation $\delta z_t \approx J_t \Delta W_\ell x_t$ .

Step 1: Define the global objective (Eq. 1 in the paper):

$J(\hat\theta) = \mathbb{E}_{(x,y)} \left[ \sum_{t=1}^T \text{KL}\!\left(p_\theta(\cdot \mid x, y_{<t}) \;\|\; p_{\hat\theta}(\cdot \mid x, y_{<t})\right) \right]$

We want $\hat\theta$ that minimizes this token-level KL divergence between the original model and compressed model.

Step 2: Taylor-expand the per-token KL around the uncompressed logits $z_t$ (Appendix B.1):

Since $\text{KL}(p \| p) = 0$ and $\nabla_z \text{KL}(p \| \text{softmax}(z))|_{z=z_t} = 0$ , the first-order term vanishes. The second-order expansion gives:

$\text{KL}(p_t \| \text{softmax}(z_t + \delta z_t)) = \frac{1}{2} \delta z_t^\top H_t \delta z_t + O(\|\delta z_t\|^3)$

where

$H_t = \text{Diag}(p_t) - p_t p_t^\top \quad \text{(Eq. 2)}$

This matrix $H_t$ is symmetric positive semi-definite (it is the Hessian of the KL loss w.r.t. logits). Note $H_t \mathbf{1} = 0$ , reflecting softmax invariance to constant logit shifts.

Step 3: Propagate layer perturbation to logit change using the linearization $\delta z_t \approx J_t \Delta W_\ell x_t$ , where $J_t = \partial z_t / \partial h_{\ell,t}$ is the Jacobian of the final logits w.r.t. layer $\ell$ ‘s output:

$\Delta J_{\ell,t} \approx \frac{1}{2} x_t^\top \Delta W_\ell^\top C_{\text{token},t} \Delta W_\ell x_t, \quad C_{\text{token},t} = J_t^\top H_t J_t \quad \text{(Eq. 3)}$

Step 4: Average over calibration tokens (Eq. 4). Using the trace identity $a^\top M a = \text{tr}(M aa^\top)$ and averaging:

$\Delta J_\ell \approx \frac{1}{2} \left\| C_\ell^{1/2} (W_\ell - \hat W_\ell) R_\ell^{1/2} \right\|_F^2$

where:

$R_\ell = \mathbb{E}_t[x_t x_t^\top]$ — input activation covariance (captures input geometry)
$C_\ell = \mathbb{E}_t[C_{\text{token},t}]$ — output sensitivity matrix (captures how the predictive distribution changes)

The moment-decoupling approximation used to go from the per-token product to this form assumes that inputs $x_t$ and the output curvature $C_{\text{token},t}$ are approximately independent — a standard Kronecker factorization assumption.

Why this formulation is better: The objective $\|C_\ell^{1/2}(W-\hat W)R_\ell^{1/2}\|_F^2$ is equivalent to a Frobenius-norm problem on the doubly-whitened matrix $B_\ell = C_\ell^{1/2} W_\ell R_\ell^{1/2}$ . By Eckart–Young–Mirsky, the optimal low-rank approximation of $W_\ell$ under this objective is found by:

Form $B_\ell = C_\ell^{1/2} W_\ell R_\ell^{1/2}$ — doubly whiten (Eq. 5)
Compute rank- $r$ truncated SVD: $\hat B_\ell = U_r \Sigma_r V_r^\top$ (Eq. 7)
Undo whitening: $\hat W_\ell^* = C_\ell^{-1/2} U_r \Sigma_r V_r^\top R_\ell^{-1/2}$ (Eq. 8)

Algorithm: IO-SVD Layer Compression
Input:  W_ℓ ∈ ℝ^{m×n}, R_ℓ (input cov.), C_ℓ (output sensitivity), rank r
Output: Ŵ_ℓ ≈ W_ℓ (low-rank)

Step 1: Compute R_ℓ^{1/2}, C_ℓ^{1/2} via eigendecomposition
         (use damped estimates: R̄_ℓ = R_ℓ + λ_R I to ensure invertibility)

Step 2: Form doubly-whitened matrix
         B_ℓ = C_ℓ^{1/2} W_ℓ R_ℓ^{1/2}

Step 3: Compute truncated SVD of B_ℓ
         U_r, Σ_r, V_r = top-r SVD(B_ℓ)

Step 4: Unwhiten to recover compressed weight
         Ŵ_ℓ = C_ℓ^{-1/2} U_r Σ_r V_r^T R_ℓ^{-1/2}

Design choice: why damping? Without damping, $R_\ell$ or $C_\ell$ may be nearly singular (some directions in activation space are never excited by the calibration data). Adding $\lambda I$ ensures we can take square roots and inverses stably. The damping constants $\lambda_R, \lambda_C$ are hyperparameters chosen small enough not to distort the geometry but large enough to prevent numerical issues.

What would happen without output whitening? Methods like SVD-LLM use only $R_\ell$ (one-sided). Two whitened singular components with the same $\sigma_i$ in $W R^{1/2}$ might have very different downstream effects depending on whether they point in directions with high output sensitivity (large eigenvalues of $C_\ell$ ). Without $C_\ell$ , we treat all output directions equally — a miss.

Efficiently computing $C_\ell$

The naive computation of $C_{\text{token},t} = J_t^\top H_t J_t$ requires materializing $J_t \in \mathbb{R}^{V \times d_{out}}$ and $H_t \in \mathbb{R}^{V \times V}$ — both are huge (vocabulary $V \sim 32000$ ).

IO-SVD sidesteps this with top-K approximation: restrict to the top- $K$ tokens in the uncompressed model’s output distribution, renormalize $p_{t,K}$ on this support, and accumulate $C_\ell$ using vector-Jacobian products (VJPs) via backward hooks. Let:

$s_t = \sqrt{p_{t,K}}, \quad D_t = \text{Diag}(s_t), \quad \Omega_t = I - s_t s_t^\top$

Then $H_{t,K} = A_t A_t^\top$ where $A_t = D_t \Omega_t$ , and $\Omega_t^2 = \Omega_t$ is an orthogonal projector onto the subspace perpendicular to $s_t$ . This factorization lets us accumulate $C_\ell$ by running $K$ backward passes with unit vectors, each of $O(Kd_{out})$ cost — much cheaper than explicit $V$ -dimensional Jacobians.

The ablation over $K$ (Figure 3 in the paper) shows a “sweet spot” where more top tokens stop helping and may hurt. The optimal $K$ on WikiText2 generalizes to PTB and C4, suggesting robustness.

Component 2: Adaptive Heterogeneous Rank Allocation

The Problem with Fixed Rank Ratios

Existing methods often allocate rank proportionally to layer size (e.g., compress each layer to 80% of its parameters). But different layers have dramatically different sensitivities to compression: early layers, attention heads, or certain MLP blocks may be far more critical to downstream quality than others. Forcing a uniform compression ratio wastes capacity in easy-to-compress layers and damages critical ones.

Alternative approach: gradient-based optimization (as in Dobi-SVD) can find optimal per-layer ranks but requires running backpropagation through the compressed model at each search step — computationally expensive.

IO-SVD’s Solution: Greedy Score-Based Allocation

Given the whitened representation $B_\ell = U_\ell \Sigma_\ell V_\ell^\top$ , how much does dropping the $i$ -th singular component hurt the loss?

First-order score derivation (Eq. 10 in the paper): The calibration loss $\mathcal{L}$ depends on $W_\ell$ . The corresponding whitened gradient is:

$\tilde{G}_\ell = C_\ell^{-1/2} G_\ell R_\ell^{-1/2}, \quad G_\ell = \frac{\partial \mathcal{L}}{\partial W_\ell}$

The derivative of $\mathcal{L}$ w.r.t. the $i$ -th singular value of $B_\ell$ is:

$g_{\ell,i} = u_{\ell,i}^\top \tilde{G}_\ell v_{\ell,i}$

If we drop this component, the change in $\sigma_{\ell,i}$ is $-\sigma_{\ell,i}$ , so the predicted first-order loss change is:

$\Delta \mathcal{L}_{\ell,i} \approx g_{\ell,i} \cdot (-\sigma_{\ell,i})$

The importance score is therefore:

$I_{\ell,i} = |g_{\ell,i} \sigma_{\ell,i}| \quad \text{(Eq. 10)}$

This is a product of gradient magnitude and singular value magnitude — a component is considered unimportant if either the singular value is small (it contributes little to the weight itself) or the gradient is small (the loss is insensitive to changes in that direction).

The greedy allocation algorithm (Algorithm 2):

Input:  {W_ℓ} with whitened SVDs, global budget B_target, min-rank ratio η
Output: {r_ℓ} (per-layer ranks), {Ŵ_ℓ} (compressed weights)

1. For each layer ℓ:
     a. Compute B_ℓ = C_ℓ^{1/2} W_ℓ R_ℓ^{1/2}; run SVD
     b. Compute whitened gradient G̃_ℓ; score each component I_{ℓ,i} = |g_{ℓ,i} σ_{ℓ,i}|
     c. Set r_ℓ = r_ℓ^max = min(m_ℓ, n_ℓ)
     d. Compute breakeven rank: r_ℓ* = ⌊m_ℓ n_ℓ / (m_ℓ + n_ℓ)⌋
     e. Set r_ℓ^min = ⌈η · r_ℓ*⌉  (minimum rank floor, e.g. η=0.1)
     f. Push tail component (I_{ℓ,r_ℓ}, ℓ, r_ℓ, storage_gain) into shared min-heap Q

2. While budget b < B_target and Q not empty:
     a. Pop (I_{ℓ,i}, ℓ, i, Δb) with SMALLEST score from Q  [min-heap]
     b. Drop this singular component; update b += Δb, r_ℓ -= 1
     c. If r_ℓ > r_ℓ^min: push next tail candidate of layer ℓ back into Q

3. Reconstruct: for each layer ℓ:
     If r_ℓ > r_ℓ*: keep dense W_ℓ (no compression benefit)
     Else: Ŵ_ℓ = C_ℓ^{-1/2} U_{ℓ,1:r_ℓ} Σ_{ℓ,1:r_ℓ} V_{ℓ,1:r_ℓ}^T R_ℓ^{-1/2}

Why greedy works here: The first-order score $I_{\ell,i}$ gives an independent estimate of each component’s loss contribution. While greedy is not globally optimal (it ignores interaction effects), in practice these interactions are small when compressing one component at a time, and the first-order approximation is good in the low-compression regime where the compressed model is still close to the original.

The storage gain formula (Eq. 14) is crucial for efficiency. For a weight $W_\ell \in \mathbb{R}^{m_\ell \times n_\ell}$ , the dense representation has $m_\ell n_\ell$ parameters. A rank- $r$ factorization uses $r(m_\ell + n_\ell)$ . The breakeven is $r_\ell^* = \lfloor m_\ell n_\ell / (m_\ell + n_\ell) \rfloor$ .

The storage gain from dropping rank from $r$ to $r-1$ is:

$\text{storage\_gain}(r) = \begin{cases} 0 & r > r_\ell^* + 1 \\ m_\ell n_\ell - r_\ell^*(m_\ell + n_\ell) & r = r_\ell^* + 1 \\ m_\ell + n_\ell & r \leq r_\ell^* \end{cases}$

This piecewise formula has important implications: the algorithm won’t waste budget removing singular components from a layer that’s still above breakeven rank (since those removals save zero storage). It only starts realizing savings once a layer crosses below its breakeven rank.

Visualization of the allocation process:

graph TD
    A["Initialize: all layers at max rank<br>Score every tail component I_{ℓ,i} = |g_{ℓ,i}σ_{ℓ,i}|"]
    A --> B["Shared min-heap Q with one tail candidate per layer"]
    B --> C{"Budget b < B_target?"}
    C -->|Yes| D["Pop layer ℓ* with min score (least important component)"]
    D --> E["Drop tail component of ℓ*; r_{ℓ*} -= 1; b += storage_gain"]
    E --> F{"r_{ℓ*} > r_min?"}
    F -->|Yes| G["Push ℓ*'s new tail component back into Q"]
    F -->|No| H["Layer ℓ* hits min-rank floor; no more from ℓ*"]
    G --> C
    H --> C
    C -->|No| I["Stop: heterogeneous {r_ℓ} achieved"]
    I --> J["Unwhiten: Ŵ_ℓ = C_ℓ^{-1/2} Û_ℓ Σ̂_ℓ V̂_ℓ^T R_ℓ^{-1/2}"]

Component 3: Loss-Aware Remapping for Hybrid Compression

The Storage-Quality Trade-off Problem

Pure low-rank compression has a hard limit: at rank $r$ , you need $r(m+n)$ parameters. To reach very aggressive compression ratios (e.g., 40% retention), you must discard many singular values, hurting quality severely.

Dobi-SVD’s partial solution: Combine SVD truncation with quantization. After SVD, write $\hat W_\ell = A_\ell D_\ell^\top$ where $A_\ell, D_\ell$ are the low-rank factors. Selectively store some rows of $A_\ell$ or $D_\ell$ in 8-bit integer instead of 16-bit float — this lets you keep more singular components (higher rank) at the same bit budget, trading some quantization error for lower truncation error. But Dobi-SVD’s row selection is structural and fixed — it doesn’t consider which rows, if quantized, would actually hurt the loss.

IO-SVD’s improvement: loss-aware row selection

After SVD truncation, for each candidate row $r_{\ell,i}$ from factor $A_\ell$ or $D_\ell$ :

Simulate int8 quantization: compute $\Delta r_{\ell,i} = Q_8(r_{\ell,i}) - r_{\ell,i}$ — the quantization error
Score by first-order loss impact (Eq. in Section 3.3): using calibration gradients $\gamma_{\ell,i} = \partial \mathcal{L}^{cal} / \partial r_{\ell,i}$ :

$s_{\ell,i} = |\langle \gamma_{\ell,i}, Q_8(r_{\ell,i}) - r_{\ell,i} \rangle|$

This is the inner product of the loss gradient and the quantization error — it estimates how much quantizing row $i$ increases the calibration loss.

Greedy selection: sort all candidate rows by score; greedily quantize rows with the smallest predicted loss impact until the remaining compression budget $C_{rem}$ is met.

Algorithm: Loss-Aware Remapping
Input:  compressed factors A_ℓ, D_ℓ; calibration gradients γ_{ℓ,i}; remaining budget C_rem
Output: hybrid A_ℓ, D_ℓ (some rows in int8)

1. For each row r_{ℓ,i} in A_ℓ ∪ D_ℓ:
     Δr = Q8(r_{ℓ,i}) - r_{ℓ,i}           # quantization error
     s_{ℓ,i} = |⟨γ_{ℓ,i}, Δr⟩|            # predicted loss impact

2. Sort rows by s_{ℓ,i} ascending (low score = safe to quantize)

3. While remaining budget not met:
     Quantize the next row (smallest s_{ℓ,i}) to int8
     Mark its storage as reduced from 2 bytes to 1 byte per element

Why does this help? Not all rows in the low-rank factors contribute equally to the output. Some rows span directions with large gradients (the loss changes a lot if you perturb them); others are in gradient-insensitive directions. By quantizing the gradient-insensitive rows first, you preserve quality while achieving the compression target.

Boundary condition: At very aggressive compression (> 50% pruning), even loss-aware remapping may not recover full quality. The authors note that at 60% pruning they switch to HQ (Half-prune + Quantization): first SVD at twice the target rate, then quantize to 8-bit to reach the final budget. This two-stage approach keeps more singular information while using quantization to squeeze down.

System Architecture Overview

flowchart LR
    subgraph Input
        W["Dense Weight W_ℓ ∈ ℝ^{m×n}"]
        D["Calibration data D_cal"]
    end

    subgraph Statistics["Step 1: Compute Statistics (online, via hooks)"]
        R["R_ℓ = E[xx^T]\n(input activation covariance)"]
        C_mat["C_ℓ = E[J^T H J]\n(KL output curvature,\ntop-K approximation)"]
        G["G_ℓ = ∂L_cal/∂W_ℓ\n(gradient for scoring)"]
    end

    subgraph Whitening["Step 2: Double-Sided Whitening"]
        B["B_ℓ = C_ℓ^{1/2} W_ℓ R_ℓ^{1/2}"]
        SVD_B["SVD(B_ℓ) = U_ℓ Σ_ℓ V_ℓ^T"]
    end

    subgraph Allocation["Step 3: Rank Allocation (global)"]
        Score["I_{ℓ,i} = |g_{ℓ,i} σ_{ℓ,i}|"]
        Heap["Min-heap Q of tail components"]
        Greedy["Greedy removal until budget met"]
    end

    subgraph Unwhiten["Step 4: Reconstruct"]
        What["Ŵ_ℓ = C_ℓ^{-1/2} Û Σ̂ V̂^T R_ℓ^{-1/2}"]
    end

    subgraph Remap["Step 5: Loss-Aware Remapping (optional)"]
        Factors["A_ℓ D_ℓ^T = Ŵ_ℓ"]
        Quant["Score rows by |⟨γ,Δr⟩|; int8 lowest-impact rows"]
        Hybrid["Hybrid A_ℓ (mixed fp16/int8)"]
    end

    W --> Statistics
    D --> Statistics
    Statistics --> Whitening
    Whitening --> Allocation
    Allocation --> Unwhiten
    Unwhiten --> Remap

Experiments and Results

Setup

Models: LLaMA-7B, LLaMA-13B, LLaMA-2-7B, OPT-6.7B, Vicuna-7B, LLaVA-1.5 7B/13B, SmolVLM 2B
Calibration: 256 randomly sampled WikiText2 sequences, length 2048
Compression targets: attention Q/K/V/O projections and MLP layers
Baselines: ASVD, SVD-LLM, Dobi-SVD, ZS-SVD
Evaluation: PPL on WikiText2/PTB/C4; zero-shot accuracy on OpenBookQA, ARC-Easy/Challenge, WinoGrande, HellaSwag, PIQA, MathQA

Figure: LLaMA-7B Perplexity vs. Compression Ratio

LLaMA-7B WikiText2 PPL (lower = better)
Maintenance ratio:   0.8     0.6     0.4
─────────────────────────────────────────
Baseline (FP16):     5.68    5.68    5.68
─────────────────────────────────────────
ASVD:               11.14   1407     57057
SVD-LLM:             7.94   13.11   53.74
Dobi-SVD:            8.54   13.54   46.18
ZS-SVD:              6.74   11.44   45.17
IO-SVD (ours):       6.41    9.84   27.70
─────────────────────────────────────────
  + remapping:
Dobi-SVD∗:           6.08    8.12    9.95
ZS-SVD∗:             5.90    6.96    6.73
IO-SVD‡:             5.59    6.27    6.41
─────────────────────────────────────────

At 80% retention: IO-SVD (6.41) beats SVD-LLM (7.94) by 1.53 PPL. At 60%: 9.84 vs. 13.11. At 40%: 27.70 vs. 53.74 — a nearly 2× improvement in avoiding perplexity explosion. With loss-aware remapping, IO-SVD‡ achieves 5.59 at 80% retention — within 0.9 PPL of the uncompressed model.

Key observation: ASVD completely collapses at 60% and 40% retention (PPL > 1000 at 60%), demonstrating why activation-aware methods matter. SVD-LLM also degrades severely at 40%. IO-SVD’s double-sided whitening and heterogeneous allocation provide far more graceful quality degradation.

Ablation: What Contributes What?

Table 4 in the paper ablates the individual components:

Whitening type          Het. Rank   PPL (0.8)  PPL (0.6)  PPL (0.4)
─────────────────────────────────────────────────────────────────────
Input-only (SVD-LLM)       No         7.95       13.11      53.74
Double-sided (OBD-LLM)     No         7.36       11.34      32.95
Double-sided (IO-SVD)       No         7.31       11.20      32.09
─────────────────────────────────────────────────────────────────────
Input-only (SVD-LLM)      Yes         6.72       11.65      62.76
Double-sided (OBD-LLM)    Yes         6.45        9.90      28.19
Double-sided (IO-SVD)      Yes         6.41        9.84      27.70
─────────────────────────────────────────────────────────────────────

Findings:

Heterogeneous rank allocation alone gives a large gain: 7.95 → 6.72 PPL (at 0.8 ratio) just from adaptive rank allocation with SVD-LLM whitening
Double-sided whitening further improves by ~0.3-1.0 PPL in each setting
KL vs. Kronecker curvature for the output metric (IO-SVD vs. OBD-LLM): small additional gain, most visible at aggressive compression

This decomposition reveals that heterogeneous rank allocation is the dominant contribution, with double-sided whitening as a complementary improvement.

VLM Compression Results

For visual-language models (LLaVA-1.5 7B, 13B and SmolVLM 2B), IO-SVD is evaluated on ScienceQA-IMG and SEED-Bench. It applies compression only to Q/K/V attention projections (consistent with VLM-specific methods QSVD and WSVD).

At 70% retention on LLaVA-1.5 7B:

ASVD: 50.12
SVD-LLM: 63.71
IO-SVD: 68.07 (best)

For SmolVLM 2B at 80% retention:

ASVD: 3.82%, SVD-LLM: 17.20%, IO-SVD: 82.65% on ScienceQA-IMG

SmolVLM benefit is dramatic because smaller models often have less redundancy, making activation-aware compression even more critical.

Cross-Architecture Generalization

Table 3 evaluates OPT-6.7B, Vicuna-7B, and LLaMA-13B at 20% pruning:

Model          Baseline PPL    ZS-SVD PPL    IO-SVD PPL
OPT-6.7B          10.86           11.40         11.10
Vicuna-7B          6.78            8.08          7.36
LLaMA-13B          5.09            5.84          5.60

IO-SVD consistently beats ZS-SVD across all three architectures, demonstrating that the method is not overfitted to LLaMA.

Inference Speed and Memory

Experiments on a single NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM), compressing LLaMA-27B at batch 64, sequence length 1024+1024:

Configuration                   Throughput      Peak GPU Memory
─────────────────────────────────────────────────────────────────
Dense baseline                   470 tok/s         77.6 GB
IO-SVD (no cache opt)            483 tok/s         70.4 GB   (1.03×)
IO-SVD + V-cache compression    1392 tok/s         50.3 GB   (2.96×)
IO-SVD + V+KV cache compression 2043 tok/s         23.1 GB   (4.34×)
─────────────────────────────────────────────────────────────────

KV cache compression insight: When key and value projections are compressed to rank $r$ , the compressed KV matrix is $\hat W_{K/V} = A D^\top$ . Instead of caching the full $h_t = \hat W_{K/V} x_t$ , cache only the low-dimensional latent $z_t = D^\top x_t \in \mathbb{R}^r$ and reconstruct $h_t = A z_t$ on the fly. This reduces KV cache size from $2d_{head} T$ to $2r T$ per layer — a massive saving when $r \ll d_{head}$ .

The weight memory drops from 12.6 GB (dense) to 5.4 GB (compressed), but the dominant saving is in KV cache: 64.0 GB → 11.6 GB with V+KV compression, enabling 4.34× throughput on memory-bandwidth-limited decode.

flowchart LR
    subgraph Dense["Dense Baseline"]
        KVD["KV Cache: 64.0 GB\n(full d_head × T × L)"]
        WD["Weights: 12.6 GB"]
        TD["Total: 77.6 GB | 470 tok/s"]
    end

    subgraph CompW["IO-SVD Weights Only"]
        WC["Weights: 5.4 GB\n(low-rank factors AB^T)"]
        KVC0["KV Cache: 64.0 GB\n(unchanged)"]
        TC["Total: 70.4 GB | 483 tok/s (1.03×)"]
    end

    subgraph CompV["+ V-Cache Compression"]
        WC2["Weights: 5.4 GB"]
        KVC1["V-Cache: 19.0 GB (latent z_t = D^T x_t)\nK-Cache: 19.8 GB"]
        TC2["Total: 50.3 GB | 1392 tok/s (2.96×)"]
    end

    subgraph CompKV["+ KV-Cache Compression"]
        WC3["Weights: 5.4 GB"]
        KVC2["KV-Cache: 11.6 GB\n(both K and V latents)"]
        TC3["Total: 23.1 GB | 2043 tok/s (4.34×)"]
    end

Limitations and Boundary Conditions

IO-SVD has three main limitations acknowledged by the authors:

Top-K truncation for curvature: Restricting $H_t$ to the top- $K$ tokens may miss sensitivity from the long tail of the vocabulary. For tasks requiring rare token prediction (code generation, specialized domains), this approximation could be worse.
Greedy rank allocation: The greedy algorithm is $O(P \log L)$ where $P$ is total parameters and $L$ is number of layers, but it cannot backtrack. A globally suboptimal allocation might result if component sensitivities interact (e.g., dropping component $i$ changes the importance of component $j$ ). Second-order corrections could help but would require recomputing scores after each drop.
Scale: Evaluated up to 13B parameters. For 70B+ models, the calibration-time cost of accumulating $C_\ell$ (backward passes) and $R_\ell$ (forward passes) over 256 sequences may become non-trivial, and the greedy allocation heap may hold $O(d^2)$ candidates.

When does IO-SVD underperform? At very low compression (< 10% pruning), all SVD methods converge since any choice of whitening preserves most of the spectrum. The gains of double-sided whitening appear primarily above 20% pruning. Below this, simpler one-sided methods are perfectly adequate.

What about LoRA fine-tuning after compression? The authors mention LoRA residual recovery (à la SVD-LLM) as a future direction. IO-SVD without fine-tuning already surpasses SVD-LLM with LoRA recovery at aggressive ratios, suggesting the better initialization (from double-sided whitening) provides a stronger starting point.

Comparison with Prior SVD Methods

graph LR
    subgraph Prior["Prior Art Landscape"]
        FWSVD["FWSVD (2022)\nFisher-weighted SVD\nWeight-space recon."]
        ASVD["ASVD (2025)\nActivation-aware\n(diagonal input scaling)"]
        SVDLLM["SVD-LLM (ICLR 2025)\nCholesky whitening\n(one-sided, input covariance)\n+ LoRA recovery"]
        SVDLLM2["SVD-LLM v2 (NAACL 2025)\nHeterogeneous ranks\n(truncation loss estimate)"]
        DOBI["Dobi-SVD (ICLR 2025)\nGradient optimization\n+ quantization remapping\n(structural row selection)"]
        ZS["ZS-SVD (2026)\nOne-sided, minimize\nactivation recon while\nkeeping Δloss ≈ 0"]
        OBD["OBD-LLM (2026)\nKronecker-factored\ndouble-sided whitening"]
    end

    subgraph IOSVD["IO-SVD (2026)"]
        KL["KL-aware double-sided\nwhitening (novel output metric)"]
        HRA["Greedy heterogeneous\nrank allocation"]
        LAR["Loss-aware row\nquantization remapping"]
    end

    FWSVD --> ASVD --> SVDLLM --> SVDLLM2 --> IOSVD
    DOBI --> IOSVD
    OBD --> IOSVD

Method	Whitening	Rank Allocation	Remapping
FWSVD	Weight-space (Fisher)	Homogeneous	No
ASVD	One-sided (diagonal)	Homogeneous	No
SVD-LLM	One-sided (Cholesky)	Homogeneous	No
SVD-LLM v2	One-sided (Cholesky)	Heterogeneous	No
Dobi-SVD	None (gradient opt.)	Gradient-based	Structural
ZS-SVD	One-sided	Loss-constrained	Structural
OBD-LLM	Kronecker (two-sided)	Homogeneous	No
IO-SVD	KL-aware two-sided	Greedy (scored)	Loss-aware

Deep Dive: The Math Behind Efficient Curvature Computation

The most technically demanding part of IO-SVD is accumulating $C_\ell = \mathbb{E}_t[J_t^\top H_t J_t]$ without materializing any $V$ -dimensional objects. Let me trace through the derivation in Appendix C step by step.

Setting Up the Problem

For a target layer $\ell$ , let $h_{\ell,t} \in \mathbb{R}^{d_{out}}$ be its output at token $t$ . The top- $K$ restricted curvature is:

$C_{\text{token},t}^{(\ell)} \approx J_{t,K}^{(\ell)\top} H_{t,K} J_{t,K}^{(\ell)}, \quad J_{t,K}^{(\ell)} = \frac{\partial z_{t,K}}{\partial h_{\ell,t}} \in \mathbb{R}^{K \times d_{out}}$

where $z_{t,K} \in \mathbb{R}^K$ are the logits restricted to the top- $K$ support.

Factoring $H_{t,K}$

Define: $s_t = \sqrt{p_{t,K}} \in \mathbb{R}^K, \quad D_t = \text{Diag}(s_t) \in \mathbb{R}^{K \times K}, \quad \Omega_t = I_K - s_t s_t^\top \in \mathbb{R}^{K \times K}$

Note that since $p_{t,K}$ is renormalized, $\|s_t\|^2 = 1$ , so $\Omega_t$ is a projector: $\Omega_t^2 = \Omega_t$ .

Define $A_t = D_t \Omega_t \in \mathbb{R}^{K \times K}$ . Then:

$A_t A_t^\top = D_t \Omega_t \Omega_t^\top D_t = D_t \Omega_t D_t = \text{Diag}(p_{t,K}) - p_{t,K} p_{t,K}^\top = H_{t,K}$

So $H_{t,K} = A_t A_t^\top$ with $A_t \in \mathbb{R}^{K \times K}$ .

Converting to VJP Accumulation

$C_{\text{token},t}^{(\ell)} \approx J_{t,K}^{\top} A_t A_t^\top J_{t,K} = \left(A_t^\top J_{t,K}\right)^\top \left(A_t^\top J_{t,K}\right) = F_t^\top F_t$

where $F_t = A_t^\top J_{t,K} \in \mathbb{R}^{K \times d_{out}}$ .

Each row $f_{t,k} = e_k^\top A_t^\top J_{t,K}$ is a vector-Jacobian product (VJP): the gradient of the scalar $e_k^\top A_t^\top z_{t,K}$ with respect to $h_{\ell,t}$ . This can be computed via one backward pass with the vector $A_t^\top e_k$ .

Algorithm for accumulating $C_\ell$ :

For each calibration token t:
    1. Run forward pass; record h_{ℓ,t} and top-K support + probabilities
    2. Compute s_t = sqrt(p_{t,K}), A_t = Diag(s_t)(I - s_t s_t^T)
    3. For k = 1, ..., K:
         v_k = A_t^T e_k                    # K-dim vector
         f_k = VJP(z_{t,K}, h_{ℓ,t}, v_k)  # backward hook, O(K · d_out · depth)
    4. Accumulate: C_ℓ += F_t^T F_t = Σ_k f_k f_k^T

C_ℓ /= num_tokens   # normalize

Total cost per layer: $O(T \cdot K \cdot d_{out} \cdot L)$ backward VJPs, where $T$ = calibration tokens, $L$ = model depth. For $T=256$ , $K=64$ , $d_{out}=4096$ , $L=32$ : about 2.1 billion scalar operations — comparable to a single training step.

Why Not Full Gradient?

One might ask: why not just use the full gradient $G_\ell = \partial \mathcal{L}/\partial W_\ell$ as the curvature? The answer is that $G_\ell$ is a first-order object (gradient), while $C_\ell$ is second-order (Hessian-like). The gradient tells you which direction to move; the curvature tells you how much the loss changes when you move in each direction. For compression, we need to know which directions are “expensive” to lose — that’s curvature information, not gradient direction.

Connection to Optimal Brain Damage / GPTQ

IO-SVD’s per-component scoring $I_{\ell,i} = |g_{\ell,i} \sigma_{\ell,i}|$ is closely related to the Optimal Brain Damage (OBD) framework:

OBD (LeCun et al., 1990) scores parameters by their second-order saliency:

$\text{saliency}(\theta_i) = \frac{1}{2} H_{ii} \delta\theta_i^2$

where $H_{ii}$ is the diagonal Hessian entry. The idea: parameters with small $|H_{ii}|$ and small magnitude are safe to prune.

IO-SVD’s score $|g_{\ell,i} \sigma_{\ell,i}|$ is a first-order approximation:

$\Delta\mathcal{L} \approx g_{\ell,i} \cdot \Delta\sigma_{\ell,i} = g_{\ell,i} \cdot (-\sigma_{\ell,i})$

The magnitude $|\cdot|$ ensures we score the absolute loss impact (we don’t know if the actual change will increase or decrease loss due to the sign of $g$ , but the magnitude tells us the sensitivity scale).

Why not second-order? Second-order scoring (like GPTQ’s OBC framework) would also include the Hessian diagonal $H_{\ell,ii}$ in the score. This would capture curvature information about how the loss curves near the current parameter value, but at the cost of computing diagonal Hessian elements — which requires running a second backward pass per component. For SVD components (which are already in the whitened space where the Hessian has a simpler structure), the first-order approximation with the doubly-whitened gradient turns out to be sufficient in practice.

How IO-SVD Fits into the Post-Training Compression Landscape

graph TD
    subgraph PTQ["Post-Training Quantization (PTQ)"]
        Q1["GPTQ/OPTQ: row-wise Hessian updates\n(2nd order, expensive but accurate)"]
        Q2["SmoothQuant: activation smoothing\nfor outlier-safe quantization"]
        Q3["QuIP: incoherence processing\n(Hadamard randomization)"]
    end
    subgraph SVDComp["SVD-Based Compression"]
        S1["ASVD: one-sided (diagonal)"]
        S2["SVD-LLM: one-sided (Cholesky)"]
        S3["IO-SVD: two-sided (KL)\n+ adaptive rank + remapping"]
    end
    subgraph Pruning["Structured Pruning"]
        P1["LLM-Pruner: neuron connectivity\n(structured, hardware-friendly)"]
        P2["SliceGPT: PCA-based slice removal\n(reduces all matrix dimensions)"]
        P3["Wanda: magnitude × activation\n(unstructured)"]
    end
    subgraph Hybrid["Hybrid Approaches"]
        H1["Dobi-SVD: SVD + quantization remapping\n(gradient-optimized)"]
        H2["IO-SVD‡: SVD + loss-aware int8\n(this paper's hybrid mode)"]
    end

    S2 --> S3
    Q1 --> H1
    H1 --> H2

Key positioning: IO-SVD sits at the intersection of SVD compression and hybrid SVD-quantization methods. It does not require specialized hardware support (unlike quantization, which needs INT8/INT4 kernels) — low-rank matrix multiplication works on standard CUDA cores. But the loss-aware remapping variant adds optional INT8 for rows with low quantization sensitivity.

When to choose SVD over quantization?

Hardware without INT4/INT8 kernel support: SVD works with standard FP16 GEMM
When you need structured parameter reduction (reducing actual matrix rank, enabling smaller KV cache)
When calibration time is limited: SVD compression with 256 samples takes minutes; full GPTQ can take hours on large models

Additional Experimental Details

LLaMA-2-7B Commonsense Reasoning

Table 5 compares IO-SVD against both structured pruning and SVD methods on LLaMA-2-7B:

Method                  PIQA    HellaS.  WinoG.  ARC-e   ARC-c   Avg
─────────────────────────────────────────────────────────────────────────
Baseline (FP16)         0.78    0.57    0.69    0.76    0.43    0.65
─────────────────────────────────────────────────────────────────────────
At 40% retention:
LLM-Pruner             0.70    0.41    0.53    0.53    0.27    0.48
SliceGPT               0.65    0.57    0.60    0.43    0.32    0.51
Bonsai                 0.72    0.45    0.58    0.59    0.30    0.53
Wanda-sp               0.70    0.42    0.53    0.57    0.29    0.50
SVD-LLM                0.56    0.30    0.57    0.39    0.21    0.41
ZS-SVD                 0.63    0.34    0.60    0.46    0.25    0.45
IO-SVD                 0.61    0.33    0.59    0.51    0.23    0.45
─────────────────────────────────────────────────────────────────────────
  + remapping:
Dobi-SVD∗              0.72    0.45    0.64    0.67    0.31    0.56
ZS-SVD∗                0.72    0.46    0.67    0.66    0.33    0.57
IO-SVD‡                0.74    0.47    0.67    0.73    0.38    0.60
─────────────────────────────────────────────────────────────────────────

Several important points:

Structured pruning (LLM-Pruner, Bonsai) beats vanilla SVD at moderate compression: 0.53 vs. 0.41 for Bonsai vs. SVD-LLM. This is because structured pruning removes entire attention heads or neurons, maintaining full-rank computation in surviving components.
IO-SVD‡ beats all methods at 40% retention: 0.60 average, surpassing even the best structured pruning baseline (Bonsai 0.53).
The ARC-Challenge results tell the most interesting story: this is the hardest subset (requires multi-step reasoning), and the gap between SVD methods with and without remapping is largest here (SVD-LLM: 0.21 → IO-SVD‡: 0.38).

Remapping Ablation (Table 6)

The remapping comparison on LLaMA-7B isolates the contribution of loss-aware row selection:

Method          Mode              Wiki↓(0.8)  C4↓(0.8)  PTB↓(0.8)  Wiki↓(0.6)  C4↓(0.6)  PTB↓(0.6)
─────────────────────────────────────────────────────────────────────────────────────────────────────
SVD-LLM         compressed        7.94        15.84      16.22       13.11       49.83      63.75
                + remap∗          5.86         7.82       8.82        6.98       11.59      12.88
                + loss-aware‡     5.66         7.78       8.71        6.69       11.39      12.46

ZS-SVD          compressed        6.74        10.74      11.87       11.44       34.13      43.19
                + remap∗          5.90         7.95       8.81        6.96       11.52      12.72
                + loss-aware‡     5.69         7.92       8.78        6.69       11.46      12.80

IO-SVD          compressed        6.41         9.82      10.93        9.84       27.15      28.84
                + remap∗          5.76         7.61       8.59        6.48       10.24      10.95
                + loss-aware‡     5.59         7.62       8.56        6.27       10.15      10.89

The key insight: standard remapping (∗) gives the large gain (e.g., IO-SVD: 6.41 → 5.76 at 0.8 ratio), while loss-aware remapping (‡) gives an additional marginal improvement (5.76 → 5.59). The largest loss-aware gain is on PTB, which is out-of-distribution from the WikiText2 calibration — suggesting that loss-aware selection is more robust to distribution shift because it targets calibration-loss impact rather than structural position.

Theoretical Connections: Why Doubly-Whitened SVD Approximates the Optimal

The Eckart–Young–Mirsky theorem gives the optimal low-rank approximation under the Frobenius norm. IO-SVD reduces the problem to:

$\min_{\hat W_\ell : \text{rank}(\hat W_\ell) \leq r} \left\| C_\ell^{1/2} (W_\ell - \hat W_\ell) R_\ell^{1/2} \right\|_F^2$

This is equivalent (by substitution $B = C_\ell^{1/2} W_\ell R_\ell^{1/2}$ , $\hat B = C_\ell^{1/2} \hat W_\ell R_\ell^{1/2}$ ) to:

$\min_{\hat B : \text{rank}(\hat B) \leq r} \|B - \hat B\|_F^2$

The solution is the rank- $r$ truncated SVD of $B = C_\ell^{1/2} W_\ell R_\ell^{1/2}$ , which is globally optimal under this objective.

What the objective approximates: Under the moment-decoupling and Taylor approximations:

$\Delta J_\ell \approx \frac{1}{2} \left\| C_\ell^{1/2} (W_\ell - \hat W_\ell) R_\ell^{1/2} \right\|_F^2$

So minimizing this Frobenius norm is equivalent to minimizing the (approximate) layerwise KL divergence increase. This means IO-SVD is, in a precise mathematical sense, minimizing a second-order approximation to the actual compression-induced KL divergence — the most principled layerwise compression objective available.

The approximations involved are:

Layerwise independence (compress each layer independently, ignoring cross-layer effects)
Second-order Taylor (ignores terms $O(\|\delta z\|^3)$ )
Moment decoupling ( $\mathbb{E}[x_t x_t^\top \cdot C_{\text{token},t}] \approx \mathbb{E}[x_t x_t^\top] \cdot \mathbb{E}[C_{\text{token},t}]$ )
Top-K vocabulary restriction (ignores long tail of $p_t$ )

Each approximation is well-studied in the literature and introduces bounded error. The combination makes the method practical while retaining most of the theoretical grounding.

Relationship to LoRA and Fine-Tuning Recovery

One natural question: does IO-SVD’s better initialization from double-sided whitening translate into better outcomes when combined with LoRA fine-tuning recovery?

SVD-LLM introduced a “LoRA recovery” step: after SVD compression, add a low-rank adapter $\Delta W = A B^\top$ and fine-tune it on a small dataset to recover quality. The idea is that the compressed model provides a good starting point, and the adapter fills in the residual error.

The starting point quality hypothesis: If the compressed model is already closer to the original (lower KL divergence), then:

The residual error $W - \hat W$ is smaller in magnitude and more distributed across less sensitive directions
LoRA needs fewer steps and less capacity to recover the same quality
Final quality (compressed + adapter) should be higher

IO-SVD’s Table 1 results at 80% retention (5.59 PPL with remapping, vs. SVD-LLM’s ~5.66 with LoRA recovery) suggest that IO-SVD without fine-tuning can match SVD-LLM with fine-tuning. This is a significant practical advantage: fine-tuning requires labeled data and compute, while IO-SVD is fully post-training.

What if you combine IO-SVD with LoRA? The paper doesn’t explore this, but one would expect the combination to achieve the best results. Starting from a better initialization (IO-SVD) and then applying a small LoRA adapter should outperform both alternatives. The interesting research question is: at what compression ratio does the LoRA adapter stop helping? Intuitively, if too many singular values are removed, no amount of low-rank residual tuning can recover the lost expressivity.

A note on the search space: IO-SVD produces $\hat W_\ell = C_\ell^{-1/2} U_r \Sigma_r V_r^\top R_\ell^{-1/2}$ , which is explicitly low-rank (rank $r$ ). Adding LoRA $\Delta W = AB^\top$ gives a rank- $(r+r')$ approximation. The total parameter count is $(r + r')(m+n)$ . Choosing $r$ and $r'$ jointly would allow optimal allocation between the compressed backbone and the adapter.

Relationship to KV Cache Management in Modern LLM Serving

IO-SVD’s KV cache compression idea connects to a broader trend in LLM serving:

graph LR
    subgraph KVCacheApproaches["KV Cache Size Reduction Approaches"]
        MQA["Multi-Query Attention (MQA)\nShare K/V across heads\n(1 head for K/V, H for Q)"]
        GQA["Grouped Query Attention (GQA)\nShare K/V in groups\n(H/G heads per K/V group)"]
        MLA["Multi-head Latent Attention (MLA)\nProject K/V to low-dim latent z\nthen expand with shared U_K, U_V"]
        IOSVD["IO-SVD KV Compression\nCache low-dim z_t = D^T x_t\n(r << d_head, learned compression)"]
    end

    MQA -->|"generalized to"| GQA
    GQA -->|"further: low-rank"| MLA
    MLA -->|"post-training"| IOSVD

MQA/GQA reduce KV cache by sharing heads across groups — architecturally baked in at training time. MLA (DeepSeek V3) goes further with learned low-rank projections also at training time. IO-SVD’s KV cache compression achieves a similar effect post-training: by caching only the low-rank latent $z_t = D^\top x_t$ instead of the full key/value, it effectively “converts” a dense K/V projection to a latent attention mechanism — without any retraining.

The key difference from MLA is that IO-SVD derives the compression matrices ( $D$ , $A$ ) from the existing weight matrices via SVD + whitening, rather than learning them end-to-end. This makes it applicable to any pre-trained model without modifying the training recipe.

Reproducibility Notes

Code: https://github.com/mint-vu/IO-SVD
Calibration: 256 WikiText2 sequences, length 2048 (easy to reproduce)
Hardware used: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB) — not yet commodity hardware, but A100/H100 80 GB should work for 7-13B models
Key hyperparameters: $\lambda_R, \lambda_C$ (damping constants), $K$ (top-K for KL curvature), $\eta$ (min rank ratio)
Top-K selection: optimal $K$ determined by sweep on WikiText2 validation set; generalizes to PTB/C4
The moment-decoupling approximation (treating $x_t$ and $C_{\text{token},t}$ as independent) may not hold when the model attends differently across very diverse inputs — worth checking on domain-specific calibration data

Personal Analysis

IO-SVD represents a clean synthesis of ideas that were floating around the SVD compression literature: KL-aware objectives (from information geometry), Kronecker-factored curvature (from K-FAC), and greedy component removal (from optimal brain damage). The key novelty is the efficient computation of the output-side curvature $C_\ell$ via top-K VJPs — this avoids the vocabulary-size bottleneck that would otherwise make double-sided whitening impractical.

The most interesting result to me is the heterogeneous rank allocation being the dominant contributor (Table 4). This suggests that future work could focus on even better rank allocation strategies — perhaps second-order corrections or learned allocation policies — while the whitening objective is already “good enough.” The paper’s ablation is methodologically sound in isolating these contributions.

One tension I notice: the paper evaluates calibration on WikiText2 (in-distribution for perplexity), and the top-K curvature selection is also tuned on WikiText2. For deployment on specialized domains (medical, code, legal), the calibration distribution mismatch could be significant. A natural extension would be to study how different calibration sets affect the quality of $C_\ell$ and the resulting allocation.

The KV-cache compression as a byproduct (Section 4.2.1) is underemphasized in my opinion. Achieving 4.34× throughput and dropping from 77.6 GB to 23.1 GB peak memory with minimal quality loss is the kind of result that actually enables deployment on mid-tier hardware. This deserves a dedicated experiment varying sequence length and batch size to characterize the memory-bandwidth tradeoff more fully.

Overall, IO-SVD is a solid step toward principled, information-geometry-aware LLM compression, and the combination of all three components (double-sided whitening + heterogeneous rank allocation + loss-aware remapping) sets a strong new baseline for the field.

Comparison with MLA-style latent attention: DeepSeek V3’s Multi-head Latent Attention (MLA) also uses low-rank KV projections, but as a training-time architectural choice. IO-SVD achieves a similar effect post-training, demonstrating that the “low-rank KV” idea is not just architecturally motivated but can also be retrofitted. This convergence of ideas from different angles (training-time MLA vs. post-training SVD) suggests that low-rank KV representations are a robust and general principle.

On reproducibility at scale: The calibration process accumulates two matrices ( $R_\ell$ and $C_\ell$ ) per layer. For a 7B model with 32 layers × 7 weight matrices each, that’s 224 matrices. At $d_{out}=4096$ , each $C_\ell$ is $4096 \times 4096 \approx 128$ MB — total ~28 GB just for curvature matrices. For 70B models ( $d=8192$ , 80 layers, 7 matrices each), this becomes ~5 TB of curvature memory, far exceeding GPU memory. Practical deployment at scale would require curvature matrix compression, block-diagonal approximations, or streaming estimation — active research directions in second-order optimization.

A thought experiment: What if IO-SVD were applied not to all linear layers uniformly, but selectively — compressing only the layers identified as least sensitive? Combined with a heterogeneous rank allocation that leaves some layers fully dense, this “sparse SVD” approach might recover even more quality at the same storage budget. The current framework already supports this (layers at or above their break-even rank are kept dense), but the question of which layers to exclude entirely deserves explicit study.

Final verdict: IO-SVD represents the current best practice in post-training SVD-based LLM compression. For practitioners: if you need to deploy a 7B model on hardware where the uncompressed model barely fits, IO-SVD + KV-cache compression can give you 3-4× throughput and substantially lower memory footprint with sub-1 PPL quality loss at 80% retention — a compelling practical trade-off.

Open Questions and Future Directions

Several threads from this work deserve follow-up:

Second-order rank allocation: The greedy first-order score $|g_{\ell,i}\sigma_{\ell,i}|$ works well but is a proxy for the true loss impact. Including the diagonal Hessian entry $h_{\ell,ii}$ (analogous to OBD) could improve allocation accuracy, especially at aggressive compression ratios where first-order approximations degrade.
Domain-adaptive calibration: All experiments use WikiText2 for calibration. The top-K curvature approximation is tuned on this domain. For specialized deployment (medical, legal, code), calibration with domain-specific data would better characterize $C_\ell$ , potentially yielding better rank allocation for in-domain tasks.
Joint architecture search: IO-SVD currently compresses each layer independently after the model is trained. A natural extension is to train with SVD-structured weights from the start, jointly learning the whitening matrices and rank distribution via gradient descent — analogous to how MLA is trained end-to-end.
Multi-GPU disaggregated KV: For serving, low-rank KV latents are even more attractive in disaggregated architectures where KV caches are stored remotely (like Mooncake’s transfer engine). The smaller latent size $r \ll d_{head}$ reduces network transfer bandwidth between prefill and decode workers.
Extension to convolution and SSM layers: The doubly-whitened SVD framework applies to any linear map. State-space models (Mamba, RWKV) have their own analog of weight matrices — adapting IO-SVD’s whitening to their recurrence structure would be a natural generalization.
Quantization-aware joint optimization: Currently IO-SVD performs SVD truncation first, then remapping. A joint formulation that simultaneously decides rank and quantization targets (e.g., a mixed-integer program over {fp16, int8, int4} precision for each singular component) might find better Pareto points on the quality-compression curve.
Online/adaptive compression: For long-context workloads where the token distribution changes significantly over the sequence, a dynamic rank adaptation strategy — increasing rank for layers that become more sensitive as context grows — could yield better quality than a fixed static allocation derived from short calibration sequences.