Review date: 2026-05-29 Review author: Zhongzhu Zhou Paper reviewed: IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression Paper authors: Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri (Vanderbilt University, UC Davis) arXiv: 2605.15626 Status / Venue: arXiv preprint, May 2026
Short Answer
IO-SVD improves SVD-based post-training LLM compression by replacing one-sided activation whitening with a KL-aware double-sided whitening objective, pairing it with greedy heterogeneous rank allocation scored by first-order calibration loss, and adding loss-aware quantization remapping for hybrid SVD–int8 compression. Together these three ideas push perplexity on LLaMA-7B from 7.94 (SVD-LLM) down to 5.59 at 80% parameter retention — a gap that, at 60% retention, widens to nearly 4× improvement over ASVD.
Prerequisites
Before diving in, let me lay out the background knowledge you need to follow the mathematics and engineering decisions in this paper.
Singular Value Decomposition (SVD)
Any real matrix can be written as
where and are orthogonal matrices, and is diagonal with non-negative entries called singular values.
The Eckart–Young–Mirsky theorem guarantees that the best rank- approximation (in Frobenius or spectral norm) to is
i.e., just keep the top singular values and their singular vectors. The approximation error is in squared Frobenius norm.
In LLM compression, we apply this idea to every linear weight matrix: replace each with a product where , , . If the compression ratio is , then at we’re keeping 80% of the parameters.
Why Vanilla SVD Falls Short for LLMs
The plain truncated SVD minimizes weight reconstruction error . But in a neural network, weight perturbations matter only through their effect on the model’s output. Two perturbations with the same Frobenius norm but different alignment with the network’s active subspaces can have very different effects on loss.
Intuition: If the network rarely activates certain directions in weight space (because the input distribution never excites them), large singular values in those directions are essentially noise for inference. Vanilla SVD keeps them anyway because they have large , while discarding small singular values that might be in highly “active” directions.
What is Whitening?
Whitening is a linear transformation that makes a distribution isotropic (unit-covariance, zero-mean). If input activations have covariance , whitening maps so the transformed vector has identity covariance. After whitening, Euclidean distance is a better proxy for how much the network “cares” about a difference in that direction.
In SVD compression, the idea is: instead of minimizing in the raw weight space, minimize the activation reconstruction error:
This is equivalent to doing SVD on the whitened matrix , then unwhitening the result.
KL Divergence and Second-Order Taylor Expansion
The KL divergence measures how much a distribution differs from :
For LLMs, the natural objective when compressing is to keep the model’s predictive distribution as close as possible to the uncompressed model .
A second-order Taylor expansion of around gives:
where is the Hessian of the softmax cross-entropy (the softmax Jacobian). This quadratic form tells us which directions in logit space are most sensitive to perturbation.
Fisher Information and Kronecker Factoring
For a parametric model with parameters , the Fisher information matrix measures the curvature of the log-likelihood. For matrix-shaped parameters (like transformer weight matrices), it can often be approximated by a Kronecker product of two smaller matrices — one capturing input statistics, one capturing output (gradient) statistics. This is the key insight behind methods like K-FAC (Kronecker-Factored Approximate Curvature).
IO-SVD derives its whitening matrices from a Kronecker-factored approximation to the Hessian of the KL loss, making it information-geometry-aware.
Transformer Linear Layers and Low-Rank Compression
In a transformer, most parameters live in linear layers: query/key/value/output projections in attention, and up/gate/down projections in MLP blocks. For a model with layers and hidden dimension :
- Attention projections:
- MLP up/down projections: ,
Low-rank compression replaces each with , reducing per-layer parameter count from to . The storage break-even rank is — above this rank, storing two factors is actually more expensive than the dense matrix.
The IO-SVD Method
IO-SVD has three interlocking components. Let me walk through each in detail.
Component 1: KL-Aware Double-Sided Whitening
The Problem with One-Sided Whitening
Existing methods like SVD-LLM whiten the weight matrix by pre-multiplying with the Cholesky factor of the input activation covariance:
They then do SVD on , drop small singular values, and undo the whitening. This captures input activation geometry but ignores how errors in the layer’s output affect the downstream loss.
The Alternative that Fails: You could try to use output gradients directly, but the output Jacobian for a token (mapping from layer output to logits ) has dimension where is the vocabulary size — too large to form explicitly.
IO-SVD’s Solution: Derive the Output Metric from KL
For a target layer with weight and compression , the perturbation is . This changes layer output by , which propagates to logit perturbation .
Step 1: Define the global objective (Eq. 1 in the paper):
We want that minimizes this token-level KL divergence between the original model and compressed model.
Step 2: Taylor-expand the per-token KL around the uncompressed logits (Appendix B.1):
Since and , the first-order term vanishes. The second-order expansion gives:
where
This matrix is symmetric positive semi-definite (it is the Hessian of the KL loss w.r.t. logits). Note , reflecting softmax invariance to constant logit shifts.
Step 3: Propagate layer perturbation to logit change using the linearization , where is the Jacobian of the final logits w.r.t. layer ‘s output:
Step 4: Average over calibration tokens (Eq. 4). Using the trace identity and averaging:
where:
- — input activation covariance (captures input geometry)
- — output sensitivity matrix (captures how the predictive distribution changes)
The moment-decoupling approximation used to go from the per-token product to this form assumes that inputs and the output curvature are approximately independent — a standard Kronecker factorization assumption.
Why this formulation is better: The objective is equivalent to a Frobenius-norm problem on the doubly-whitened matrix . By Eckart–Young–Mirsky, the optimal low-rank approximation of under this objective is found by:
- Form — doubly whiten (Eq. 5)
- Compute rank- truncated SVD: (Eq. 7)
- Undo whitening: (Eq. 8)
Algorithm: IO-SVD Layer Compression
Input: W_ℓ ∈ ℝ^{m×n}, R_ℓ (input cov.), C_ℓ (output sensitivity), rank r
Output: Ŵ_ℓ ≈ W_ℓ (low-rank)
Step 1: Compute R_ℓ^{1/2}, C_ℓ^{1/2} via eigendecomposition
(use damped estimates: R̄_ℓ = R_ℓ + λ_R I to ensure invertibility)
Step 2: Form doubly-whitened matrix
B_ℓ = C_ℓ^{1/2} W_ℓ R_ℓ^{1/2}
Step 3: Compute truncated SVD of B_ℓ
U_r, Σ_r, V_r = top-r SVD(B_ℓ)
Step 4: Unwhiten to recover compressed weight
Ŵ_ℓ = C_ℓ^{-1/2} U_r Σ_r V_r^T R_ℓ^{-1/2}
Design choice: why damping? Without damping, or may be nearly singular (some directions in activation space are never excited by the calibration data). Adding ensures we can take square roots and inverses stably. The damping constants are hyperparameters chosen small enough not to distort the geometry but large enough to prevent numerical issues.
What would happen without output whitening? Methods like SVD-LLM use only (one-sided). Two whitened singular components with the same in might have very different downstream effects depending on whether they point in directions with high output sensitivity (large eigenvalues of ). Without , we treat all output directions equally — a miss.
Efficiently computing
The naive computation of requires materializing and — both are huge (vocabulary ).
IO-SVD sidesteps this with top-K approximation: restrict to the top- tokens in the uncompressed model’s output distribution, renormalize on this support, and accumulate using vector-Jacobian products (VJPs) via backward hooks. Let:
Then where , and is an orthogonal projector onto the subspace perpendicular to . This factorization lets us accumulate by running backward passes with unit vectors, each of cost — much cheaper than explicit -dimensional Jacobians.
The ablation over (Figure 3 in the paper) shows a “sweet spot” where more top tokens stop helping and may hurt. The optimal on WikiText2 generalizes to PTB and C4, suggesting robustness.
Component 2: Adaptive Heterogeneous Rank Allocation
The Problem with Fixed Rank Ratios
Existing methods often allocate rank proportionally to layer size (e.g., compress each layer to 80% of its parameters). But different layers have dramatically different sensitivities to compression: early layers, attention heads, or certain MLP blocks may be far more critical to downstream quality than others. Forcing a uniform compression ratio wastes capacity in easy-to-compress layers and damages critical ones.
Alternative approach: gradient-based optimization (as in Dobi-SVD) can find optimal per-layer ranks but requires running backpropagation through the compressed model at each search step — computationally expensive.
IO-SVD’s Solution: Greedy Score-Based Allocation
Given the whitened representation , how much does dropping the -th singular component hurt the loss?
First-order score derivation (Eq. 10 in the paper): The calibration loss depends on . The corresponding whitened gradient is:
The derivative of w.r.t. the -th singular value of is:
If we drop this component, the change in is , so the predicted first-order loss change is:
The importance score is therefore:
This is a product of gradient magnitude and singular value magnitude — a component is considered unimportant if either the singular value is small (it contributes little to the weight itself) or the gradient is small (the loss is insensitive to changes in that direction).
The greedy allocation algorithm (Algorithm 2):
Input: {W_ℓ} with whitened SVDs, global budget B_target, min-rank ratio η
Output: {r_ℓ} (per-layer ranks), {Ŵ_ℓ} (compressed weights)
1. For each layer ℓ:
a. Compute B_ℓ = C_ℓ^{1/2} W_ℓ R_ℓ^{1/2}; run SVD
b. Compute whitened gradient G̃_ℓ; score each component I_{ℓ,i} = |g_{ℓ,i} σ_{ℓ,i}|
c. Set r_ℓ = r_ℓ^max = min(m_ℓ, n_ℓ)
d. Compute breakeven rank: r_ℓ* = ⌊m_ℓ n_ℓ / (m_ℓ + n_ℓ)⌋
e. Set r_ℓ^min = ⌈η · r_ℓ*⌉ (minimum rank floor, e.g. η=0.1)
f. Push tail component (I_{ℓ,r_ℓ}, ℓ, r_ℓ, storage_gain) into shared min-heap Q
2. While budget b < B_target and Q not empty:
a. Pop (I_{ℓ,i}, ℓ, i, Δb) with SMALLEST score from Q [min-heap]
b. Drop this singular component; update b += Δb, r_ℓ -= 1
c. If r_ℓ > r_ℓ^min: push next tail candidate of layer ℓ back into Q
3. Reconstruct: for each layer ℓ:
If r_ℓ > r_ℓ*: keep dense W_ℓ (no compression benefit)
Else: Ŵ_ℓ = C_ℓ^{-1/2} U_{ℓ,1:r_ℓ} Σ_{ℓ,1:r_ℓ} V_{ℓ,1:r_ℓ}^T R_ℓ^{-1/2}
Why greedy works here: The first-order score gives an independent estimate of each component’s loss contribution. While greedy is not globally optimal (it ignores interaction effects), in practice these interactions are small when compressing one component at a time, and the first-order approximation is good in the low-compression regime where the compressed model is still close to the original.
The storage gain formula (Eq. 14) is crucial for efficiency. For a weight , the dense representation has parameters. A rank- factorization uses . The breakeven is .
The storage gain from dropping rank from to is:
This piecewise formula has important implications: the algorithm won’t waste budget removing singular components from a layer that’s still above breakeven rank (since those removals save zero storage). It only starts realizing savings once a layer crosses below its breakeven rank.
Visualization of the allocation process:
graph TD
A["Initialize: all layers at max rank<br>Score every tail component I_{ℓ,i} = |g_{ℓ,i}σ_{ℓ,i}|"]
A --> B["Shared min-heap Q with one tail candidate per layer"]
B --> C{"Budget b < B_target?"}
C -->|Yes| D["Pop layer ℓ* with min score (least important component)"]
D --> E["Drop tail component of ℓ*; r_{ℓ*} -= 1; b += storage_gain"]
E --> F{"r_{ℓ*} > r_min?"}
F -->|Yes| G["Push ℓ*'s new tail component back into Q"]
F -->|No| H["Layer ℓ* hits min-rank floor; no more from ℓ*"]
G --> C
H --> C
C -->|No| I["Stop: heterogeneous {r_ℓ} achieved"]
I --> J["Unwhiten: Ŵ_ℓ = C_ℓ^{-1/2} Û_ℓ Σ̂_ℓ V̂_ℓ^T R_ℓ^{-1/2}"]
Component 3: Loss-Aware Remapping for Hybrid Compression
The Storage-Quality Trade-off Problem
Pure low-rank compression has a hard limit: at rank , you need parameters. To reach very aggressive compression ratios (e.g., 40% retention), you must discard many singular values, hurting quality severely.
Dobi-SVD’s partial solution: Combine SVD truncation with quantization. After SVD, write where are the low-rank factors. Selectively store some rows of or in 8-bit integer instead of 16-bit float — this lets you keep more singular components (higher rank) at the same bit budget, trading some quantization error for lower truncation error. But Dobi-SVD’s row selection is structural and fixed — it doesn’t consider which rows, if quantized, would actually hurt the loss.
IO-SVD’s improvement: loss-aware row selection
After SVD truncation, for each candidate row from factor or :
- Simulate int8 quantization: compute — the quantization error
- Score by first-order loss impact (Eq. in Section 3.3): using calibration gradients :
This is the inner product of the loss gradient and the quantization error — it estimates how much quantizing row increases the calibration loss.
- Greedy selection: sort all candidate rows by score; greedily quantize rows with the smallest predicted loss impact until the remaining compression budget is met.
Algorithm: Loss-Aware Remapping
Input: compressed factors A_ℓ, D_ℓ; calibration gradients γ_{ℓ,i}; remaining budget C_rem
Output: hybrid A_ℓ, D_ℓ (some rows in int8)
1. For each row r_{ℓ,i} in A_ℓ ∪ D_ℓ:
Δr = Q8(r_{ℓ,i}) - r_{ℓ,i} # quantization error
s_{ℓ,i} = |⟨γ_{ℓ,i}, Δr⟩| # predicted loss impact
2. Sort rows by s_{ℓ,i} ascending (low score = safe to quantize)
3. While remaining budget not met:
Quantize the next row (smallest s_{ℓ,i}) to int8
Mark its storage as reduced from 2 bytes to 1 byte per element
Why does this help? Not all rows in the low-rank factors contribute equally to the output. Some rows span directions with large gradients (the loss changes a lot if you perturb them); others are in gradient-insensitive directions. By quantizing the gradient-insensitive rows first, you preserve quality while achieving the compression target.
Boundary condition: At very aggressive compression (> 50% pruning), even loss-aware remapping may not recover full quality. The authors note that at 60% pruning they switch to HQ (Half-prune + Quantization): first SVD at twice the target rate, then quantize to 8-bit to reach the final budget. This two-stage approach keeps more singular information while using quantization to squeeze down.
System Architecture Overview
flowchart LR
subgraph Input
W["Dense Weight W_ℓ ∈ ℝ^{m×n}"]
D["Calibration data D_cal"]
end
subgraph Statistics["Step 1: Compute Statistics (online, via hooks)"]
R["R_ℓ = E[xx^T]\n(input activation covariance)"]
C_mat["C_ℓ = E[J^T H J]\n(KL output curvature,\ntop-K approximation)"]
G["G_ℓ = ∂L_cal/∂W_ℓ\n(gradient for scoring)"]
end
subgraph Whitening["Step 2: Double-Sided Whitening"]
B["B_ℓ = C_ℓ^{1/2} W_ℓ R_ℓ^{1/2}"]
SVD_B["SVD(B_ℓ) = U_ℓ Σ_ℓ V_ℓ^T"]
end
subgraph Allocation["Step 3: Rank Allocation (global)"]
Score["I_{ℓ,i} = |g_{ℓ,i} σ_{ℓ,i}|"]
Heap["Min-heap Q of tail components"]
Greedy["Greedy removal until budget met"]
end
subgraph Unwhiten["Step 4: Reconstruct"]
What["Ŵ_ℓ = C_ℓ^{-1/2} Û Σ̂ V̂^T R_ℓ^{-1/2}"]
end
subgraph Remap["Step 5: Loss-Aware Remapping (optional)"]
Factors["A_ℓ D_ℓ^T = Ŵ_ℓ"]
Quant["Score rows by |⟨γ,Δr⟩|; int8 lowest-impact rows"]
Hybrid["Hybrid A_ℓ (mixed fp16/int8)"]
end
W --> Statistics
D --> Statistics
Statistics --> Whitening
Whitening --> Allocation
Allocation --> Unwhiten
Unwhiten --> Remap
Experiments and Results
Setup
- Models: LLaMA-7B, LLaMA-13B, LLaMA-2-7B, OPT-6.7B, Vicuna-7B, LLaVA-1.5 7B/13B, SmolVLM 2B
- Calibration: 256 randomly sampled WikiText2 sequences, length 2048
- Compression targets: attention Q/K/V/O projections and MLP layers
- Baselines: ASVD, SVD-LLM, Dobi-SVD, ZS-SVD
- Evaluation: PPL on WikiText2/PTB/C4; zero-shot accuracy on OpenBookQA, ARC-Easy/Challenge, WinoGrande, HellaSwag, PIQA, MathQA
Figure: LLaMA-7B Perplexity vs. Compression Ratio
LLaMA-7B WikiText2 PPL (lower = better)
Maintenance ratio: 0.8 0.6 0.4
─────────────────────────────────────────
Baseline (FP16): 5.68 5.68 5.68
─────────────────────────────────────────
ASVD: 11.14 1407 57057
SVD-LLM: 7.94 13.11 53.74
Dobi-SVD: 8.54 13.54 46.18
ZS-SVD: 6.74 11.44 45.17
IO-SVD (ours): 6.41 9.84 27.70
─────────────────────────────────────────
+ remapping:
Dobi-SVD∗: 6.08 8.12 9.95
ZS-SVD∗: 5.90 6.96 6.73
IO-SVD‡: 5.59 6.27 6.41
─────────────────────────────────────────
At 80% retention: IO-SVD (6.41) beats SVD-LLM (7.94) by 1.53 PPL. At 60%: 9.84 vs. 13.11. At 40%: 27.70 vs. 53.74 — a nearly 2× improvement in avoiding perplexity explosion. With loss-aware remapping, IO-SVD‡ achieves 5.59 at 80% retention — within 0.9 PPL of the uncompressed model.
Key observation: ASVD completely collapses at 60% and 40% retention (PPL > 1000 at 60%), demonstrating why activation-aware methods matter. SVD-LLM also degrades severely at 40%. IO-SVD’s double-sided whitening and heterogeneous allocation provide far more graceful quality degradation.
Ablation: What Contributes What?
Table 4 in the paper ablates the individual components:
Whitening type Het. Rank PPL (0.8) PPL (0.6) PPL (0.4)
─────────────────────────────────────────────────────────────────────
Input-only (SVD-LLM) No 7.95 13.11 53.74
Double-sided (OBD-LLM) No 7.36 11.34 32.95
Double-sided (IO-SVD) No 7.31 11.20 32.09
─────────────────────────────────────────────────────────────────────
Input-only (SVD-LLM) Yes 6.72 11.65 62.76
Double-sided (OBD-LLM) Yes 6.45 9.90 28.19
Double-sided (IO-SVD) Yes 6.41 9.84 27.70
─────────────────────────────────────────────────────────────────────
Findings:
- Heterogeneous rank allocation alone gives a large gain: 7.95 → 6.72 PPL (at 0.8 ratio) just from adaptive rank allocation with SVD-LLM whitening
- Double-sided whitening further improves by ~0.3-1.0 PPL in each setting
- KL vs. Kronecker curvature for the output metric (IO-SVD vs. OBD-LLM): small additional gain, most visible at aggressive compression
This decomposition reveals that heterogeneous rank allocation is the dominant contribution, with double-sided whitening as a complementary improvement.
VLM Compression Results
For visual-language models (LLaVA-1.5 7B, 13B and SmolVLM 2B), IO-SVD is evaluated on ScienceQA-IMG and SEED-Bench. It applies compression only to Q/K/V attention projections (consistent with VLM-specific methods QSVD and WSVD).
At 70% retention on LLaVA-1.5 7B:
- ASVD: 50.12
- SVD-LLM: 63.71
- IO-SVD: 68.07 (best)
For SmolVLM 2B at 80% retention:
- ASVD: 3.82%, SVD-LLM: 17.20%, IO-SVD: 82.65% on ScienceQA-IMG
SmolVLM benefit is dramatic because smaller models often have less redundancy, making activation-aware compression even more critical.
Cross-Architecture Generalization
Table 3 evaluates OPT-6.7B, Vicuna-7B, and LLaMA-13B at 20% pruning:
Model Baseline PPL ZS-SVD PPL IO-SVD PPL
OPT-6.7B 10.86 11.40 11.10
Vicuna-7B 6.78 8.08 7.36
LLaMA-13B 5.09 5.84 5.60
IO-SVD consistently beats ZS-SVD across all three architectures, demonstrating that the method is not overfitted to LLaMA.
Inference Speed and Memory
Experiments on a single NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM), compressing LLaMA-27B at batch 64, sequence length 1024+1024:
Configuration Throughput Peak GPU Memory
─────────────────────────────────────────────────────────────────
Dense baseline 470 tok/s 77.6 GB
IO-SVD (no cache opt) 483 tok/s 70.4 GB (1.03×)
IO-SVD + V-cache compression 1392 tok/s 50.3 GB (2.96×)
IO-SVD + V+KV cache compression 2043 tok/s 23.1 GB (4.34×)
─────────────────────────────────────────────────────────────────
KV cache compression insight: When key and value projections are compressed to rank , the compressed KV matrix is . Instead of caching the full , cache only the low-dimensional latent and reconstruct on the fly. This reduces KV cache size from to per layer — a massive saving when .
The weight memory drops from 12.6 GB (dense) to 5.4 GB (compressed), but the dominant saving is in KV cache: 64.0 GB → 11.6 GB with V+KV compression, enabling 4.34× throughput on memory-bandwidth-limited decode.
flowchart LR
subgraph Dense["Dense Baseline"]
KVD["KV Cache: 64.0 GB\n(full d_head × T × L)"]
WD["Weights: 12.6 GB"]
TD["Total: 77.6 GB | 470 tok/s"]
end
subgraph CompW["IO-SVD Weights Only"]
WC["Weights: 5.4 GB\n(low-rank factors AB^T)"]
KVC0["KV Cache: 64.0 GB\n(unchanged)"]
TC["Total: 70.4 GB | 483 tok/s (1.03×)"]
end
subgraph CompV["+ V-Cache Compression"]
WC2["Weights: 5.4 GB"]
KVC1["V-Cache: 19.0 GB (latent z_t = D^T x_t)\nK-Cache: 19.8 GB"]
TC2["Total: 50.3 GB | 1392 tok/s (2.96×)"]
end
subgraph CompKV["+ KV-Cache Compression"]
WC3["Weights: 5.4 GB"]
KVC2["KV-Cache: 11.6 GB\n(both K and V latents)"]
TC3["Total: 23.1 GB | 2043 tok/s (4.34×)"]
end
Limitations and Boundary Conditions
IO-SVD has three main limitations acknowledged by the authors:
-
Top-K truncation for curvature: Restricting to the top- tokens may miss sensitivity from the long tail of the vocabulary. For tasks requiring rare token prediction (code generation, specialized domains), this approximation could be worse.
-
Greedy rank allocation: The greedy algorithm is where is total parameters and is number of layers, but it cannot backtrack. A globally suboptimal allocation might result if component sensitivities interact (e.g., dropping component changes the importance of component ). Second-order corrections could help but would require recomputing scores after each drop.
-
Scale: Evaluated up to 13B parameters. For 70B+ models, the calibration-time cost of accumulating (backward passes) and (forward passes) over 256 sequences may become non-trivial, and the greedy allocation heap may hold candidates.
When does IO-SVD underperform? At very low compression (< 10% pruning), all SVD methods converge since any choice of whitening preserves most of the spectrum. The gains of double-sided whitening appear primarily above 20% pruning. Below this, simpler one-sided methods are perfectly adequate.
What about LoRA fine-tuning after compression? The authors mention LoRA residual recovery (à la SVD-LLM) as a future direction. IO-SVD without fine-tuning already surpasses SVD-LLM with LoRA recovery at aggressive ratios, suggesting the better initialization (from double-sided whitening) provides a stronger starting point.
Comparison with Prior SVD Methods
graph LR
subgraph Prior["Prior Art Landscape"]
FWSVD["FWSVD (2022)\nFisher-weighted SVD\nWeight-space recon."]
ASVD["ASVD (2025)\nActivation-aware\n(diagonal input scaling)"]
SVDLLM["SVD-LLM (ICLR 2025)\nCholesky whitening\n(one-sided, input covariance)\n+ LoRA recovery"]
SVDLLM2["SVD-LLM v2 (NAACL 2025)\nHeterogeneous ranks\n(truncation loss estimate)"]
DOBI["Dobi-SVD (ICLR 2025)\nGradient optimization\n+ quantization remapping\n(structural row selection)"]
ZS["ZS-SVD (2026)\nOne-sided, minimize\nactivation recon while\nkeeping Δloss ≈ 0"]
OBD["OBD-LLM (2026)\nKronecker-factored\ndouble-sided whitening"]
end
subgraph IOSVD["IO-SVD (2026)"]
KL["KL-aware double-sided\nwhitening (novel output metric)"]
HRA["Greedy heterogeneous\nrank allocation"]
LAR["Loss-aware row\nquantization remapping"]
end
FWSVD --> ASVD --> SVDLLM --> SVDLLM2 --> IOSVD
DOBI --> IOSVD
OBD --> IOSVD
| Method | Whitening | Rank Allocation | Remapping |
|---|---|---|---|
| FWSVD | Weight-space (Fisher) | Homogeneous | No |
| ASVD | One-sided (diagonal) | Homogeneous | No |
| SVD-LLM | One-sided (Cholesky) | Homogeneous | No |
| SVD-LLM v2 | One-sided (Cholesky) | Heterogeneous | No |
| Dobi-SVD | None (gradient opt.) | Gradient-based | Structural |
| ZS-SVD | One-sided | Loss-constrained | Structural |
| OBD-LLM | Kronecker (two-sided) | Homogeneous | No |
| IO-SVD | KL-aware two-sided | Greedy (scored) | Loss-aware |
Deep Dive: The Math Behind Efficient Curvature Computation
The most technically demanding part of IO-SVD is accumulating without materializing any -dimensional objects. Let me trace through the derivation in Appendix C step by step.
Setting Up the Problem
For a target layer , let be its output at token . The top- restricted curvature is:
where are the logits restricted to the top- support.
Factoring
Define:
Note that since is renormalized, , so is a projector: .
Define . Then:
So with .
Converting to VJP Accumulation
where .
Each row is a vector-Jacobian product (VJP): the gradient of the scalar with respect to . This can be computed via one backward pass with the vector .
Algorithm for accumulating :
For each calibration token t:
1. Run forward pass; record h_{ℓ,t} and top-K support + probabilities
2. Compute s_t = sqrt(p_{t,K}), A_t = Diag(s_t)(I - s_t s_t^T)
3. For k = 1, ..., K:
v_k = A_t^T e_k # K-dim vector
f_k = VJP(z_{t,K}, h_{ℓ,t}, v_k) # backward hook, O(K · d_out · depth)
4. Accumulate: C_ℓ += F_t^T F_t = Σ_k f_k f_k^T
C_ℓ /= num_tokens # normalize
Total cost per layer: backward VJPs, where = calibration tokens, = model depth. For , , , : about 2.1 billion scalar operations — comparable to a single training step.
Why Not Full Gradient?
One might ask: why not just use the full gradient as the curvature? The answer is that is a first-order object (gradient), while is second-order (Hessian-like). The gradient tells you which direction to move; the curvature tells you how much the loss changes when you move in each direction. For compression, we need to know which directions are “expensive” to lose — that’s curvature information, not gradient direction.
Connection to Optimal Brain Damage / GPTQ
IO-SVD’s per-component scoring is closely related to the Optimal Brain Damage (OBD) framework:
OBD (LeCun et al., 1990) scores parameters by their second-order saliency:
where is the diagonal Hessian entry. The idea: parameters with small and small magnitude are safe to prune.
IO-SVD’s score is a first-order approximation:
The magnitude ensures we score the absolute loss impact (we don’t know if the actual change will increase or decrease loss due to the sign of , but the magnitude tells us the sensitivity scale).
Why not second-order? Second-order scoring (like GPTQ’s OBC framework) would also include the Hessian diagonal in the score. This would capture curvature information about how the loss curves near the current parameter value, but at the cost of computing diagonal Hessian elements — which requires running a second backward pass per component. For SVD components (which are already in the whitened space where the Hessian has a simpler structure), the first-order approximation with the doubly-whitened gradient turns out to be sufficient in practice.
How IO-SVD Fits into the Post-Training Compression Landscape
graph TD
subgraph PTQ["Post-Training Quantization (PTQ)"]
Q1["GPTQ/OPTQ: row-wise Hessian updates\n(2nd order, expensive but accurate)"]
Q2["SmoothQuant: activation smoothing\nfor outlier-safe quantization"]
Q3["QuIP: incoherence processing\n(Hadamard randomization)"]
end
subgraph SVDComp["SVD-Based Compression"]
S1["ASVD: one-sided (diagonal)"]
S2["SVD-LLM: one-sided (Cholesky)"]
S3["IO-SVD: two-sided (KL)\n+ adaptive rank + remapping"]
end
subgraph Pruning["Structured Pruning"]
P1["LLM-Pruner: neuron connectivity\n(structured, hardware-friendly)"]
P2["SliceGPT: PCA-based slice removal\n(reduces all matrix dimensions)"]
P3["Wanda: magnitude × activation\n(unstructured)"]
end
subgraph Hybrid["Hybrid Approaches"]
H1["Dobi-SVD: SVD + quantization remapping\n(gradient-optimized)"]
H2["IO-SVD‡: SVD + loss-aware int8\n(this paper's hybrid mode)"]
end
S2 --> S3
Q1 --> H1
H1 --> H2
Key positioning: IO-SVD sits at the intersection of SVD compression and hybrid SVD-quantization methods. It does not require specialized hardware support (unlike quantization, which needs INT8/INT4 kernels) — low-rank matrix multiplication works on standard CUDA cores. But the loss-aware remapping variant adds optional INT8 for rows with low quantization sensitivity.
When to choose SVD over quantization?
- Hardware without INT4/INT8 kernel support: SVD works with standard FP16 GEMM
- When you need structured parameter reduction (reducing actual matrix rank, enabling smaller KV cache)
- When calibration time is limited: SVD compression with 256 samples takes minutes; full GPTQ can take hours on large models
Additional Experimental Details
LLaMA-2-7B Commonsense Reasoning
Table 5 compares IO-SVD against both structured pruning and SVD methods on LLaMA-2-7B:
Method PIQA HellaS. WinoG. ARC-e ARC-c Avg
─────────────────────────────────────────────────────────────────────────
Baseline (FP16) 0.78 0.57 0.69 0.76 0.43 0.65
─────────────────────────────────────────────────────────────────────────
At 40% retention:
LLM-Pruner 0.70 0.41 0.53 0.53 0.27 0.48
SliceGPT 0.65 0.57 0.60 0.43 0.32 0.51
Bonsai 0.72 0.45 0.58 0.59 0.30 0.53
Wanda-sp 0.70 0.42 0.53 0.57 0.29 0.50
SVD-LLM 0.56 0.30 0.57 0.39 0.21 0.41
ZS-SVD 0.63 0.34 0.60 0.46 0.25 0.45
IO-SVD 0.61 0.33 0.59 0.51 0.23 0.45
─────────────────────────────────────────────────────────────────────────
+ remapping:
Dobi-SVD∗ 0.72 0.45 0.64 0.67 0.31 0.56
ZS-SVD∗ 0.72 0.46 0.67 0.66 0.33 0.57
IO-SVD‡ 0.74 0.47 0.67 0.73 0.38 0.60
─────────────────────────────────────────────────────────────────────────
Several important points:
- Structured pruning (LLM-Pruner, Bonsai) beats vanilla SVD at moderate compression: 0.53 vs. 0.41 for Bonsai vs. SVD-LLM. This is because structured pruning removes entire attention heads or neurons, maintaining full-rank computation in surviving components.
- IO-SVD‡ beats all methods at 40% retention: 0.60 average, surpassing even the best structured pruning baseline (Bonsai 0.53).
- The ARC-Challenge results tell the most interesting story: this is the hardest subset (requires multi-step reasoning), and the gap between SVD methods with and without remapping is largest here (SVD-LLM: 0.21 → IO-SVD‡: 0.38).
Remapping Ablation (Table 6)
The remapping comparison on LLaMA-7B isolates the contribution of loss-aware row selection:
Method Mode Wiki↓(0.8) C4↓(0.8) PTB↓(0.8) Wiki↓(0.6) C4↓(0.6) PTB↓(0.6)
─────────────────────────────────────────────────────────────────────────────────────────────────────
SVD-LLM compressed 7.94 15.84 16.22 13.11 49.83 63.75
+ remap∗ 5.86 7.82 8.82 6.98 11.59 12.88
+ loss-aware‡ 5.66 7.78 8.71 6.69 11.39 12.46
ZS-SVD compressed 6.74 10.74 11.87 11.44 34.13 43.19
+ remap∗ 5.90 7.95 8.81 6.96 11.52 12.72
+ loss-aware‡ 5.69 7.92 8.78 6.69 11.46 12.80
IO-SVD compressed 6.41 9.82 10.93 9.84 27.15 28.84
+ remap∗ 5.76 7.61 8.59 6.48 10.24 10.95
+ loss-aware‡ 5.59 7.62 8.56 6.27 10.15 10.89
The key insight: standard remapping (∗) gives the large gain (e.g., IO-SVD: 6.41 → 5.76 at 0.8 ratio), while loss-aware remapping (‡) gives an additional marginal improvement (5.76 → 5.59). The largest loss-aware gain is on PTB, which is out-of-distribution from the WikiText2 calibration — suggesting that loss-aware selection is more robust to distribution shift because it targets calibration-loss impact rather than structural position.
Theoretical Connections: Why Doubly-Whitened SVD Approximates the Optimal
The Eckart–Young–Mirsky theorem gives the optimal low-rank approximation under the Frobenius norm. IO-SVD reduces the problem to:
This is equivalent (by substitution , ) to:
The solution is the rank- truncated SVD of , which is globally optimal under this objective.
What the objective approximates: Under the moment-decoupling and Taylor approximations:
So minimizing this Frobenius norm is equivalent to minimizing the (approximate) layerwise KL divergence increase. This means IO-SVD is, in a precise mathematical sense, minimizing a second-order approximation to the actual compression-induced KL divergence — the most principled layerwise compression objective available.
The approximations involved are:
- Layerwise independence (compress each layer independently, ignoring cross-layer effects)
- Second-order Taylor (ignores terms )
- Moment decoupling ()
- Top-K vocabulary restriction (ignores long tail of )
Each approximation is well-studied in the literature and introduces bounded error. The combination makes the method practical while retaining most of the theoretical grounding.
Relationship to LoRA and Fine-Tuning Recovery
One natural question: does IO-SVD’s better initialization from double-sided whitening translate into better outcomes when combined with LoRA fine-tuning recovery?
SVD-LLM introduced a “LoRA recovery” step: after SVD compression, add a low-rank adapter and fine-tune it on a small dataset to recover quality. The idea is that the compressed model provides a good starting point, and the adapter fills in the residual error.
The starting point quality hypothesis: If the compressed model is already closer to the original (lower KL divergence), then:
- The residual error is smaller in magnitude and more distributed across less sensitive directions
- LoRA needs fewer steps and less capacity to recover the same quality
- Final quality (compressed + adapter) should be higher
IO-SVD’s Table 1 results at 80% retention (5.59 PPL with remapping, vs. SVD-LLM’s ~5.66 with LoRA recovery) suggest that IO-SVD without fine-tuning can match SVD-LLM with fine-tuning. This is a significant practical advantage: fine-tuning requires labeled data and compute, while IO-SVD is fully post-training.
What if you combine IO-SVD with LoRA? The paper doesn’t explore this, but one would expect the combination to achieve the best results. Starting from a better initialization (IO-SVD) and then applying a small LoRA adapter should outperform both alternatives. The interesting research question is: at what compression ratio does the LoRA adapter stop helping? Intuitively, if too many singular values are removed, no amount of low-rank residual tuning can recover the lost expressivity.
A note on the search space: IO-SVD produces , which is explicitly low-rank (rank ). Adding LoRA gives a rank- approximation. The total parameter count is . Choosing and jointly would allow optimal allocation between the compressed backbone and the adapter.
Relationship to KV Cache Management in Modern LLM Serving
IO-SVD’s KV cache compression idea connects to a broader trend in LLM serving:
graph LR
subgraph KVCacheApproaches["KV Cache Size Reduction Approaches"]
MQA["Multi-Query Attention (MQA)\nShare K/V across heads\n(1 head for K/V, H for Q)"]
GQA["Grouped Query Attention (GQA)\nShare K/V in groups\n(H/G heads per K/V group)"]
MLA["Multi-head Latent Attention (MLA)\nProject K/V to low-dim latent z\nthen expand with shared U_K, U_V"]
IOSVD["IO-SVD KV Compression\nCache low-dim z_t = D^T x_t\n(r << d_head, learned compression)"]
end
MQA -->|"generalized to"| GQA
GQA -->|"further: low-rank"| MLA
MLA -->|"post-training"| IOSVD
MQA/GQA reduce KV cache by sharing heads across groups — architecturally baked in at training time. MLA (DeepSeek V3) goes further with learned low-rank projections also at training time. IO-SVD’s KV cache compression achieves a similar effect post-training: by caching only the low-rank latent instead of the full key/value, it effectively “converts” a dense K/V projection to a latent attention mechanism — without any retraining.
The key difference from MLA is that IO-SVD derives the compression matrices (, ) from the existing weight matrices via SVD + whitening, rather than learning them end-to-end. This makes it applicable to any pre-trained model without modifying the training recipe.
Reproducibility Notes
- Code: https://github.com/mint-vu/IO-SVD
- Calibration: 256 WikiText2 sequences, length 2048 (easy to reproduce)
- Hardware used: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB) — not yet commodity hardware, but A100/H100 80 GB should work for 7-13B models
- Key hyperparameters: (damping constants), (top-K for KL curvature), (min rank ratio)
- Top-K selection: optimal determined by sweep on WikiText2 validation set; generalizes to PTB/C4
- The moment-decoupling approximation (treating and as independent) may not hold when the model attends differently across very diverse inputs — worth checking on domain-specific calibration data
Personal Analysis
IO-SVD represents a clean synthesis of ideas that were floating around the SVD compression literature: KL-aware objectives (from information geometry), Kronecker-factored curvature (from K-FAC), and greedy component removal (from optimal brain damage). The key novelty is the efficient computation of the output-side curvature via top-K VJPs — this avoids the vocabulary-size bottleneck that would otherwise make double-sided whitening impractical.
The most interesting result to me is the heterogeneous rank allocation being the dominant contributor (Table 4). This suggests that future work could focus on even better rank allocation strategies — perhaps second-order corrections or learned allocation policies — while the whitening objective is already “good enough.” The paper’s ablation is methodologically sound in isolating these contributions.
One tension I notice: the paper evaluates calibration on WikiText2 (in-distribution for perplexity), and the top-K curvature selection is also tuned on WikiText2. For deployment on specialized domains (medical, code, legal), the calibration distribution mismatch could be significant. A natural extension would be to study how different calibration sets affect the quality of and the resulting allocation.
The KV-cache compression as a byproduct (Section 4.2.1) is underemphasized in my opinion. Achieving 4.34× throughput and dropping from 77.6 GB to 23.1 GB peak memory with minimal quality loss is the kind of result that actually enables deployment on mid-tier hardware. This deserves a dedicated experiment varying sequence length and batch size to characterize the memory-bandwidth tradeoff more fully.
Overall, IO-SVD is a solid step toward principled, information-geometry-aware LLM compression, and the combination of all three components (double-sided whitening + heterogeneous rank allocation + loss-aware remapping) sets a strong new baseline for the field.
Comparison with MLA-style latent attention: DeepSeek V3’s Multi-head Latent Attention (MLA) also uses low-rank KV projections, but as a training-time architectural choice. IO-SVD achieves a similar effect post-training, demonstrating that the “low-rank KV” idea is not just architecturally motivated but can also be retrofitted. This convergence of ideas from different angles (training-time MLA vs. post-training SVD) suggests that low-rank KV representations are a robust and general principle.
On reproducibility at scale: The calibration process accumulates two matrices ( and ) per layer. For a 7B model with 32 layers × 7 weight matrices each, that’s 224 matrices. At , each is MB — total ~28 GB just for curvature matrices. For 70B models (, 80 layers, 7 matrices each), this becomes ~5 TB of curvature memory, far exceeding GPU memory. Practical deployment at scale would require curvature matrix compression, block-diagonal approximations, or streaming estimation — active research directions in second-order optimization.
A thought experiment: What if IO-SVD were applied not to all linear layers uniformly, but selectively — compressing only the layers identified as least sensitive? Combined with a heterogeneous rank allocation that leaves some layers fully dense, this “sparse SVD” approach might recover even more quality at the same storage budget. The current framework already supports this (layers at or above their break-even rank are kept dense), but the question of which layers to exclude entirely deserves explicit study.
Final verdict: IO-SVD represents the current best practice in post-training SVD-based LLM compression. For practitioners: if you need to deploy a 7B model on hardware where the uncompressed model barely fits, IO-SVD + KV-cache compression can give you 3-4× throughput and substantially lower memory footprint with sub-1 PPL quality loss at 80% retention — a compelling practical trade-off.
Open Questions and Future Directions
Several threads from this work deserve follow-up:
-
Second-order rank allocation: The greedy first-order score works well but is a proxy for the true loss impact. Including the diagonal Hessian entry (analogous to OBD) could improve allocation accuracy, especially at aggressive compression ratios where first-order approximations degrade.
-
Domain-adaptive calibration: All experiments use WikiText2 for calibration. The top-K curvature approximation is tuned on this domain. For specialized deployment (medical, legal, code), calibration with domain-specific data would better characterize , potentially yielding better rank allocation for in-domain tasks.
-
Joint architecture search: IO-SVD currently compresses each layer independently after the model is trained. A natural extension is to train with SVD-structured weights from the start, jointly learning the whitening matrices and rank distribution via gradient descent — analogous to how MLA is trained end-to-end.
-
Multi-GPU disaggregated KV: For serving, low-rank KV latents are even more attractive in disaggregated architectures where KV caches are stored remotely (like Mooncake’s transfer engine). The smaller latent size reduces network transfer bandwidth between prefill and decode workers.
-
Extension to convolution and SSM layers: The doubly-whitened SVD framework applies to any linear map. State-space models (Mamba, RWKV) have their own analog of weight matrices — adapting IO-SVD’s whitening to their recurrence structure would be a natural generalization.
-
Quantization-aware joint optimization: Currently IO-SVD performs SVD truncation first, then remapping. A joint formulation that simultaneously decides rank and quantization targets (e.g., a mixed-integer program over {fp16, int8, int4} precision for each singular component) might find better Pareto points on the quality-compression curve.
-
Online/adaptive compression: For long-context workloads where the token distribution changes significantly over the sequence, a dynamic rank adaptation strategy — increasing rank for layers that become more sensitive as context grows — could yield better quality than a fixed static allocation derived from short calibration sequences.