July 3, 2026 EN #SVD & Low-Rank #Model Compression #LLM Inference

AIR: Activation- and Influence-Aware SVD Compression for LLMs — Technical Review

Review date: 2026-07-03 Review author: Zhongzhu Zhou Paper reviewed: Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs Paper authors: Nico Harder et al. arXiv: 2606.19993 Venue / Status: ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM), Seoul, South Korea

Short Answer

AIR (Activation- and Influence-Aware Ranks) is an SVD-based LLM weight compression method that improves upon the best purely activation-aware baseline — SVD-LLM(W) — by folding a backward-signal influence term into the compression objective. The core insight is straightforward but impactful: not every weight element contributes equally to the model’s predictions, and a backward relevance propagation pass can cheaply identify which elements matter most. AIR encodes this importance as an element-wise matrix $I \in \mathbb{R}^{m \times n}$ and integrates it into the Frobenius-norm loss via a Hadamard (element-wise) weighting scheme of the form $(1 + \delta I)$ . Because this weighted objective no longer has an analytic closed-form SVD solution, AIR solves it with a single closed-form alternating least squares (ALS) sweep, processing rank components from least-significant to most-significant, initialized from the SVD-LLM(W) solution. Proposition 3.1 guarantees monotone descent: the ALS sweep can never degrade quality below the activation-aware baseline.

The empirical payoff is substantial. On LLaMA-7B with WikiText-2 evaluation, AIR improves perplexity over SVD-LLM(W) by 4.6 pct at 80 pct parameter retention, 18.4 pct at 60 pct, 33.4 pct at 40 pct, and 44.7 pct at 20 pct. On the C4 benchmark the gains are even larger (14.5–70.4 pct), suggesting better out-of-distribution generalization. Translated to hardware, 60 pct parameter retention yields a 64 pct peak-memory reduction and a 53 pct per-token latency cut on an A100 40GB. AIR requires only ~12 minutes of extra calibration time on LLaMA-7B and matches SVD-LLM(W) quality with approximately 10× fewer calibration samples. When combined with LoRA fine-tuning, AIR+LoRA surpasses ACIP — a costlier end-to-end optimization competitor — at every retention rate. A key ablation shows that the specific backward signal chosen (LRP, Weight×Gradient, Fisher) barely matters; what matters is that the signal be integrated element-wise, not aggregated row-wise.

In one sentence: AIR adds a closed-form, provably non-degrading element-wise backward-signal correction sweep on top of activation-aware SVD initialization, buying large perplexity gains at low cost and composing cleanly with LoRA fine-tuning.

Prerequisites

Singular Value Decomposition and the Eckart-Young Theorem

The Singular Value Decomposition (SVD) is the mathematical foundation of low-rank compression. Any real matrix $W \in \mathbb{R}^{m \times n}$ can be exactly decomposed as

W = U \Sigma V^\top

where $U \in \mathbb{R}^{m \times m}$ is an orthogonal matrix ( $U^\top U = I$ ), $\Sigma \in \mathbb{R}^{m \times n}$ is a diagonal matrix with non-negative entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ called singular values, and $V \in \mathbb{R}^{n \times n}$ is orthogonal. The columns of $U$ and $V$ are the left and right singular vectors, respectively. Each outer product $\sigma_i u_i v_i^\top$ is a rank-1 matrix, so the full SVD expresses $W$ as a sum of $\min(m,n)$ rank-1 matrices weighted by decreasing singular values.

The rank- $k$ truncated approximation keeps only the top $k$ terms:

W_k = U_k \Sigma_k V_k^\top = \sum_{i=1}^{k} \sigma_i \, u_i v_i^\top

where $U_k \in \mathbb{R}^{m \times k}$ , $\Sigma_k \in \mathbb{R}^{k \times k}$ (diagonal), $V_k \in \mathbb{R}^{n \times k}$ .

The Frobenius norm of a matrix $M$ is the square root of the sum of squared entries:

\| M \|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} M_{ij}^2}

It is the element-wise analogue of the Euclidean norm for vectors. For SVD, the residual Frobenius norm after rank- $k$ truncation equals

\| W - W_k \|_F^2 = \sum_{i=k+1}^{\min(m,n)} \sigma_i^2

The Eckart–Young–Mirsky theorem (1936) states that the rank- $k$ SVD truncation is the globally optimal rank- $k$ approximation in the Frobenius norm:

W_k = \arg\min_{\text{rank}(A) \leq k} \| W - A \|_F^2

No rank- $k$ matrix achieves a smaller Frobenius distance from $W$ than $W_k$ . This is a strong and elegant guarantee: the greedy singular-value truncation is globally optimal.

Why vanilla SVD fails for LLM compression. The Frobenius norm treats all elements of $W$ with equal weight. But weight elements connected to rarely-activated input directions contribute almost nothing to the layer’s output regardless of their magnitude. Conversely, small weights in high-activation directions can be critical. The Eckart-Young optimum for the plain Frobenius objective is therefore not aligned with preserving model function. This is the central gap that activation-aware and influence-aware methods seek to close.

Low-Rank Approximation and LLM Compression

Modern LLMs are composed almost entirely of linear layers: attention projections ( $W_Q, W_K, W_V, W_O$ ) and feedforward layers ( $W_{up}, W_{gate}, W_{down}$ ). These matrices are individually large — for LLaMA-7B, dimensions of 4096×4096 or 4096×11008 are common — but empirically exhibit fast singular value decay, indicating that the effective rank is much lower than the nominal dimension.

When we compress a weight $W \in \mathbb{R}^{m \times n}$ to rank $k$ , storing the two factors $U_k \in \mathbb{R}^{m \times k}$ and $V_k \in \mathbb{R}^{k \times n}$ requires $k(m+n)$ parameters instead of $mn$ . The parameter retention rate is

r = \frac{k(m+n)}{mn} = k\left(\frac{1}{n} + \frac{1}{m}\right)

For a square matrix ( $m = n$ ), retaining 60 pct of parameters means $k = 0.3n$ .

At inference, the compressed matrix-vector product splits into two smaller ones:

y = W_k x = U_k (V_k^\top x)

The first operation $V_k^\top x$ costs $2kn$ MACs, and $U_k (\cdot)$ costs $2km$ MACs, for a total of $2k(m+n)$ versus the original $2mn$ . The FLOP saving ratio is $1 - k(1/n + 1/m)$ , matching the parameter saving ratio for square matrices. This computational saving translates directly to inference speedup, peak memory reduction, and throughput improvement — making SVD-based compression uniquely attractive for deployment.

Activation-Aware Compression: ASVD and SVD-LLM

Pure Frobenius-optimal compression ignores how the weight matrix is actually used during inference. The activation-aware perspective corrects this by asking: which directions in the input space carry real signal, and how much do errors in each direction affect the layer’s output?

ASVD (Activation-aware SVD) addresses this with per-channel input scaling. For each input channel $j$ , it computes a scaling factor proportional to the root-mean-square of that channel’s activations over calibration data, rescales the input, performs SVD, and rescales back. This is equivalent to a diagonal approximation of the activation covariance.

SVD-LLM(W) takes the full activation covariance approach. Given calibration data $\mathcal{D}_{cal}$ , it collects the hidden state matrix $X \in \mathbb{R}^{n \times T}$ (hidden dimension × number of tokens) and forms the activation covariance:

\Sigma_{\mathcal{D}_{cal}} = \sum_{d \in \mathcal{D}_{cal}} X_d X_d^\top \in \mathbb{R}^{n \times n}

The Cholesky factorization $\Sigma = S S^\top$ yields the lower-triangular profiling matrix $S \in \mathbb{R}^{n \times n}$ . The profiled weight is $W' = WS$ , and the activation-aware objective is:

\mathcal{L}_{act} = \| W' - U'_k \Sigma'_k V'^{\top}_k \|_F^2 = \| (W - W_k) S \|_F^2

By the Eckart-Young theorem, vanilla SVD applied to $W'$ minimizes this objective exactly. SVD-LLM(W) is therefore the analytically optimal solution to the activation-aware compression problem and is extremely hard to beat with local methods — until AIR.

Backward-Signal Methods: LRP, Weight×Gradient, and Fisher Information

A complementary line of inquiry asks: which weight elements most influence the final prediction? Three backward-signal approaches exist:

Weight × Gradient (W×G): From the first-order Taylor expansion of the loss $\mathcal{L}$ at perturbation $\delta W$ :

\mathcal{L}(W + \delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L},\ \delta W \rangle + O(\|\delta W\|^2)

The element-wise sensitivity is $|W_{ij} \cdot (\nabla_W \mathcal{L})_{ij}|$ . This “saliency” score measures the first-order impact of removing each weight on the loss. It is cheap to compute: one backward pass through the model, element-wise multiplication, done.

Fisher Information: The second-order approach uses the empirical Fisher matrix as a curvature estimate. FWSVD computes a per-row Fisher by summing $(\nabla_{W_{i\cdot}} \mathcal{L})(\nabla_{W_{i\cdot}} \mathcal{L})^\top$ over calibration samples. This row-wise aggregation collapses the column dimension — a key limitation.

Layer-wise Relevance Propagation (LRP, ε-rule): LRP is a backpropagation variant that propagates “relevance” (rather than gradient) backward through the network. For a linear layer with pre-activations $z_j = \sum_i a_i W_{ij} + b_j$ and output relevances $\{R_j\}$ , the ε-LRP rule assigns input relevances as:

R_i^{(\ell)} = \sum_j \frac{a_i W_{ij}}{z_j + \epsilon \cdot \text{sign}(z_j)} \, R_j^{(\ell+1)}

with $\epsilon = 10^{-6}$ for numerical stability. The key advantage is that LRP initialized at the network output (e.g., $R_j = f(x)_j$ for the output logits) and propagated to layer $\ell$ ‘s weight matrix yields a per-weight relevance matrix $\tilde{I}(d) \in \mathbb{R}^{m \times n}$ for each sample $d$ . This preserves the full element-wise spatial structure — precisely what row-wise Fisher destroys.

AttnLRP extends LRP to transformer attention operations, handling the softmax and product-attention nonlinearities via adapted propagation rules. AIR uses AttnLRP with $\epsilon = 10^{-6}$ to compute element-wise influence for each weight matrix in every transformer layer.

Alternating Least Squares (ALS)

ALS is a classical optimization technique for problems that are jointly non-convex but convex in each variable when the others are fixed. Given the factorization problem

\min_{U, V} f(U, V)

where $f$ is convex in $U$ given $V$ and convex in $V$ given $U$ , ALS alternates:

Fix $V$ , solve globally optimal $U^* = \arg\min_U f(U, V)$
Fix $U^*$ , solve globally optimal $V^* = \arg\min_V f(U^*, V)$
Repeat until convergence

Each step is a convex subproblem with a closed-form solution via weighted least squares. Since each step globally minimizes its subproblem, the objective is monotonically non-increasing. In AIR, the ALS subproblems are weighted least squares, where the weights are determined by the element-wise influence matrix $I$ . The specific structure of AIR’s objective makes each rank-wise subproblem individually solvable in closed form — a key technical contribution.

Why Existing Methods Fall Short

The SVD-based compression methods landscape can be organized by two axes:

Does the method use forward (activation) information?
Does the method use backward (loss influence) information?

Vanilla SVD: neither. ASVD and SVD-LLM(W): forward only. FWSVD: backward only (row-wise Fisher). ACIP: both, but via end-to-end learnable mask optimization.

The striking empirical failure of FWSVD — at 80 pct retention, FWSVD achieves WikiText-2 PPL of 22026 versus SVD-LLM(W)‘s 7.87 — reveals that having a backward signal is not sufficient. The signal must be integrated with (a) proper element-wise granularity and (b) an activation-aware foundation. FWSVD lacks both: row-wise aggregation discards element spatial structure, and there is no activation whitening step. AIR’s design precisely repairs both deficiencies.

ACIP is the only prior method that integrates both signals effectively, but it does so via full end-to-end optimization across layers — computationally expensive and architecturally constrained. AIR achieves comparable or better performance with a purely local, closed-form procedure.

Motivation and Contributions

The core question AIR asks is sharper than “can backward signals improve SVD compression?” The real question is: given that backward signals encode functional role information, why do methods that use them fail?

The authors’ answer identifies aggregation granularity as the culprit. FWSVD has a backward signal (Fisher information) but aggregates it row-wise, destroying the column-level information about which specific (input direction, output direction) weight interactions matter. The activation-aware methods succeed precisely because they operate at fine spatial granularity — whitening treats each input dimension individually.

AIR’s key contributions:

Contribution 1 — Hybrid weighted objective (Eq. 4): Multiply the activation-whitened residual element-wise by $\sqrt{1 + \delta I}$ , where $I \in \mathbb{R}^{m \times n}$ is the element-wise influence matrix. The additive all-ones matrix $\mathbf{1}$ in $(1 + \delta I)$ ensures that low-influence elements still pay an activation-weighted penalty — they are not silently freed from the objective.

Contribution 2 — Closed-form ALS sweep: The weighted objective has no analytic SVD solution, but it decomposes rank-by-rank into individual weighted least squares problems, each with a closed-form solution via simple matrix operations. A single backward sweep (from $r = k-1$ to $r = 0$ ) suffices.

Contribution 3 — Monotone descent proof (Proposition 3.1): Each ALS step is globally optimal in its subspace, so the objective cannot increase. AIR is therefore a safe upgrade over SVD-LLM(W): it either improves or leaves unchanged.

Contribution 4 — Signal-agnosticism: The ablation finding that LRP, W×G, and Fisher all yield identical perplexity simplifies deployment. Any element-wise backward signal works. Practitioners can use W×G (one backward pass, one element-wise multiply) without implementing AttnLRP.

Contribution 5 — LoRA composability: AIR+LoRA outperforms ACIP at every retention rate, demonstrating that the local compression step provides a superior initialization for fine-tuning.

Figure 1 — SVD-Based Compression Methods Landscape

flowchart TD
    ROOT["SVD-Based LLM Compression Methods"]
    ROOT --> FWD["Forward Signal Only"]
    ROOT --> BWD["Backward Signal Only"]
    ROOT --> NONE["No Signal (Frobenius)"]
    ROOT --> BOTH["Forward + Backward"]

    NONE --> VSVD["Vanilla SVD\nEckart-Young optimal on W\nPPL@80pct: 19438"]

    FWD --> ASVD["ASVD\nPer-channel diagonal whitening\nno covariance"]
    FWD --> SVDLLM["SVD-LLM(W)\nFull activation whitening via Cholesky\nBest prior local method\nPPL@80pct: 7.87"]

    BWD --> FWSVD["FWSVD\nRow-wise Fisher + SVD\nNo activation whitening\nPPL@80pct: 22026 (WORSE than vanilla!)"]

    BOTH --> ACIP["ACIP\nEnd-to-end L1-mask optimization\nHigh compute cost\nhours of calibration"]
    BOTH --> AIR["AIR (this work)\nElement-wise influence + ALS sweep\n12 min calibration\nPPL@80pct: 7.51"]

    AIR --> AIRP["AIR + LoRA\nOutperforms ACIP at all retention rates"]

Method: Activation- and Influence-Aware Ranks (AIR)

AIR operates entirely at the level of individual weight matrices. Given a weight matrix $W \in \mathbb{R}^{m \times n}$ , calibration data $\mathcal{D}_{cal}$ , a target rank $k$ , and a scalar hyperparameter $\delta \geq 0$ , it outputs factors $U_k \in \mathbb{R}^{m \times k}$ and $V_k^\top \in \mathbb{R}^{k \times n}$ for the compressed representation $W_k = U_k V_k^\top$ .

3.1 Forward Analysis: Activation Whitening

The forward analysis matches SVD-LLM(W) exactly. AIR collects hidden state vectors at the input of the target layer across all calibration tokens. Let $X_d \in \mathbb{R}^{n \times T_d}$ be the hidden state matrix for calibration sample $d$ , where $T_d$ is the number of tokens. The activation covariance is:

\Sigma_{\mathcal{D}_{cal}} = \sum_{d \in \mathcal{D}_{cal}} X_d X_d^\top \in \mathbb{R}^{n \times n}

The Cholesky factorization $\Sigma = S S^\top$ yields the lower-triangular profiling matrix $S$ . The profiled weight is then:

W' = W S \in \mathbb{R}^{m \times n} \tag{Eq. 1}

Geometrically, right-multiplying by $S$ transforms the weight space so that directions proportional to actual activation variance are amplified, while near-zero-variance directions (corresponding to rarely active input channels) are contracted. SVD on $W'$ then minimizes:

\mathcal{L}_{act} = \left\| W' - U'_k \Sigma'_k V'^{\top}_k \right\|_F^2 = \left\| (W - W_k) S \right\|_F^2

which is the expected squared output error when inputs are drawn from $\mathcal{D}_{cal}$ . By Eckart-Young, this is the globally optimal solution to $\mathcal{L}_{act}$ .

Design choice — Cholesky vs. eigendecomposition: Both produce a valid whitening transform. Cholesky is preferred because it is numerically stable for positive definite matrices, produces a square invertible $S$ needed for back-projection ( $S^{-1}$ ), and is computationally efficient. In practice, a small diagonal regularizer $\epsilon I$ is added to $\Sigma$ before Cholesky to handle near-singular covariances.

Boundary condition: If calibration data covers only part of the model’s operating distribution, $\Sigma$ underestimates variance in missing directions. Extreme OOD inputs may have higher error than predicted by $\mathcal{L}_{act}$ . This is a fundamental limitation of all activation-aware methods, including AIR.

Figure 2 — Forward and Backward Analysis Pipeline

flowchart LR
    CAL["Calibration data D_cal\n(128 samples typical)"]
    W["Weight matrix W\nin R^(m x n)"]

    subgraph FWD["Forward Analysis"]
        direction TB
        HID["Run model forward\nCollect hidden states X"]
        COV["Activation covariance\nSigma = sum(X X^T)"]
        CHOL["Cholesky: S s.t. S S^T = Sigma"]
        PROF["Profiled weight\nW' = W S  (Eq.1)"]
        HID --> COV --> CHOL --> PROF
    end

    subgraph BWD["Backward Analysis (AttnLRP)"]
        direction TB
        RELI["Initialize relevance at output\nR = f(x) per sample"]
        PROP["Propagate backward\nthrough attention + FFN"]
        ITILD["Per-weight relevance\nI_tilde(d) per sample d"]
        AGG["Aggregate + normalize\nI = sum|I_tilde(d)|, unit mean"]
        RELI --> PROP --> ITILD --> AGG
    end

    CAL --> HID
    CAL --> RELI
    W --> PROF

    PROF --> OBJ["AIR Hybrid Objective (Eq.4)\nL = || sqrt(1 + delta*I) hadamard (W' - W'_k) ||_F^2"]
    AGG --> OBJ

    OBJ --> ALS["ALS Sweep (single pass\nr = k-1 downto 0)"]
    ALS --> BACK["Back-project to native space\nU_k, V^T_k"]
    BACK --> OUT["Compressed W_k = U_k V^T_k\n(2 factor inference)"]

3.2 Backward Analysis: Element-wise Influence Matrix

While the forward pass identifies which input directions carry signal, the backward pass identifies which weight elements actually influence the final prediction. The influence matrix $I \in \mathbb{R}^{m \times n}$ captures the functional role of each individual weight element.

Connection to Taylor expansion. Consider the second-order Taylor expansion of the loss around the current weights (Eq. 2 of the paper):

\mathcal{L}(\hat{W}_k) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W),\ \hat{W}_k - W \rangle + \frac{1}{2} \text{tr}\left[ (\hat{W}_k - W)^\top H_W (\hat{W}_k - W) \right] + \ldots \tag{Eq. 2}

The first-order term $\langle \nabla_W \mathcal{L}, \hat{W}_k - W \rangle$ is bounded element-wise by $|(\nabla_W \mathcal{L})_{ij}| \cdot |(\hat{W}_k - W)_{ij}|$ . The element-wise product $|W_{ij} \cdot (\nabla_W \mathcal{L})_{ij}|$ — the Weight × Gradient signal — measures how much each weight element contributes to the first-order change in loss when perturbed.

LRP as a richer influence proxy. AttnLRP with $\epsilon = 10^{-6}$ propagates relevances backward through the network from the output. For each calibration sample $d$ , the per-weight relevance $\tilde{I}(d) \in \mathbb{R}^{m \times n}$ captures the functional contribution of each weight element. The aggregate influence matrix is:

I = \sum_{d \in \mathcal{D}_{cal}} |\tilde{I}(d)|

normalized to unit mean per layer (so $\delta$ is dimensionless and consistent across layers of different scales).

Critical property — element-wise resolution. $I \in \mathbb{R}^{m \times n}$ retains the same shape as $W$ , with per-element granularity. This is what FWSVD lacks: FWSVD’s row-wise Fisher sums out the column index, producing a vector $\in \mathbb{R}^m$ that carries no information about which specific input channels are coupled to high-influence outputs. AIR uses the full matrix $I$ , preserving this structure.

Ablation finding. The paper demonstrates that LRP-ε, W×G, and per-element Fisher yield essentially identical perplexity after ALS integration. This implies that the ranking of element-wise importances (not the precise values) is what matters, and all three signals agree on this ranking. The improvement over FWSVD is not attributable to LRP being a better signal than Fisher — it is attributable to element-wise vs. row-wise granularity.

3.3 The AIR Objective

With both $W'$ (activation-whitened) and $I$ (element-wise influence) computed, AIR defines the hybrid objective:

\mathcal{L}_{act,infl} = \left\| \sqrt{\mathbf{1} + \delta I} \odot \left(W' - U'_k \Sigma'_k V'^{\top}_k\right) \right\|_F^2 \tag{Eq. 4}

where $\mathbf{1} \in \mathbb{R}^{m \times n}$ is the all-ones matrix, $\delta \geq 0$ is a scalar, $\odot$ denotes the Hadamard (element-wise) product, and $\sqrt{\cdot}$ is element-wise square root.

Expanding element-wise:

\mathcal{L}_{act,infl} = \sum_{i=1}^{m} \sum_{j=1}^{n} (1 + \delta I_{ij}) \left(W'_{ij} - (U'_k \Sigma'_k V'^{\top}_k)_{ij}\right)^2

This is a weighted least-squares objective: each element’s squared error is scaled by $(1 + \delta I_{ij})$ . Elements with high influence ( $I_{ij}$ large) receive heavier penalization for approximation error.

Why the additive all-ones anchor is essential. If the weight were simply $\delta I$ (without $\mathbf{1}$ ), elements with $I_{ij} = 0$ would contribute zero to the objective — the approximation could freely degrade them. But zero influence does not mean zero cost: errors in zero-influence elements still produce incorrect activations in the whitened space. The additive $\mathbf{1}$ ensures the objective never drops below the activation-aware floor. Formally:

\mathcal{L}_{act,infl}\big|_{\delta=0} = \left\| \mathbf{1} \odot \left(W' - U'_k \Sigma'_k V'^{\top}_k\right) \right\|_F^2 = \| W' - U'_k \Sigma'_k V'^{\top}_k \|_F^2 = \mathcal{L}_{act}

So at $\delta = 0$ , AIR recovers SVD-LLM(W) exactly.

Design choice — additive vs. multiplicative anchor. Alternatives include $\delta I$ (no anchor), $e^{\delta I}$ (exponential), or $\max(1, \delta I)$ (hard floor at 1). The paper’s choice of $(1 + \delta I)$ is motivated by the additive structure: low-influence elements always retain unit weight from $\mathbf{1}$ , while high-influence elements are upweighted by $\delta I_{ij}$ . This is a linear interpolation between the activation-aware objective (all weights = 1) and a fully influence-weighted objective (weights $\propto I$ ).

Hyperparameter $\delta$ . This controls the relative emphasis of influence-guided vs. activation-guided refinement. Larger $\delta$ pushes the optimizer toward preserving high-influence weight elements at the possible cost of high-activation-but-low-influence elements. In practice, $\delta$ is tuned on a held-out validation perplexity. The paper demonstrates that AIR provides consistent gains across a range of $\delta$ values.

3.4 ALS Solution: Closed-Form Rank-wise Updates

The weighted Frobenius objective is not minimized by standard SVD (which requires uniform weights). AIR solves it via alternating least squares, processing each rank component $r$ from $k-1$ down to $0$ in a single sweep.

Initialization from SVD-LLM(W):

U'_k,\ \Sigma'_k,\ V'_k \leftarrow \text{SVD}(W',\ k)

This is the globally optimal solution to $\mathcal{L}_{act}$ , providing the strongest possible warm start.

Why sweep backward (from $r = k-1$ to $r = 0$ )? The ALS loop updates components from the least significant (smallest $\sigma'_r$ ) to the most significant (largest $\sigma'_0$ ). Minor components are corrected first, and when the major components are updated, all minor corrections are already incorporated. This direction is what enables the monotone descent guarantee: each update to rank $r$ takes into account the most recently improved versions of all rank components $> r$ .

Rank- $r$ residual: To isolate the $r$ -th component for optimization, we temporarily remove it from the current rank- $k$ approximation $W'_k = \sum_{s=0}^{k-1} \sigma'_s u'_s v'^{\top}_s$ :

E_r = W' - W'_k + \sigma'_r u'_r v'^{\top}_r \in \mathbb{R}^{m \times n}

This leaves $E_r \approx \sigma'_r u'_r v'^{\top}_r$ plus whatever residual the other components cannot explain.

Update for $v'_r$ (Eq. 5) — derivation. Define $C = \mathbf{1} + \delta I \in \mathbb{R}^{m \times n}$ (element-wise, $C_{ij} \geq 1$ ). Fix $u'_r$ (normalized, $\|u'_r\|_2 = 1$ ) and $\sigma'_r > 0$ , and minimize over $v'_r$ :

\min_{v'_r} \sum_{i,j} C_{ij} \left(E_{r,ij} - \sigma'_r (u'_r)_i (v'_r)_j\right)^2

Taking the derivative with respect to $(v'_r)_j$ and setting to zero:

\frac{\partial}{\partial (v'_r)_j} \sum_{i} C_{ij} \left(E_{r,ij} - \sigma'_r (u'_r)_i (v'_r)_j\right)^2 = 0

-2 \sum_{i} C_{ij} (u'_r)_i \left(E_{r,ij} - \sigma'_r (u'_r)_i (v'_r)_j\right) = 0

\sigma'_r (v'_r)_j \sum_{i} C_{ij} (u'_r)_i^2 = \sum_{i} C_{ij} (u'_r)_i E_{r,ij}

(v'_r)_j = \frac{\sum_i C_{ij} (u'_r)_i E_{r,ij}}{\sigma'_r \sum_i C_{ij} (u'_r)_i^2}

In compact matrix notation, with $(u'^{\odot 2}_r)$ denoting elementwise square of $u'_r$ :

v'^{\top}_r = \frac{u'^{\top}_r \left[(1 + \delta I) \odot E_r\right]}{\sigma'_r \cdot (u'^{\odot 2}_r)^{\top} (1 + \delta I)} \tag{Eq. 5}

The numerator $u'^{\top}_r [(C \odot E_r)]$ is a row vector in $\mathbb{R}^{1 \times n}$ ; the denominator $(u'^{\odot 2}_r)^{\top} C$ is also in $\mathbb{R}^{1 \times n}$ ; dividing element-wise and transposing gives $v'_r \in \mathbb{R}^{n \times 1}$ .

Update for $\tilde{u}'_r$ (Eq. 6) — derivation. Now fix the updated $v'_r$ and minimize over $u'_r$ (without normalization constraint, yielding $\tilde{u}'_r$ ):

\min_{\tilde{u}'_r} \sum_{i,j} C_{ij} \left(E_{r,ij} - (\tilde{u}'_r)_i (v'_r)_j\right)^2

By symmetry of the derivation, differentiating over $(\tilde{u}'_r)_i$ and solving:

(\tilde{u}'_r)_i = \frac{\sum_j C_{ij} (v'_r)_j E_{r,ij}}{\sum_j C_{ij} (v'_r)_j^2}

In matrix notation:

\tilde{u}'_r = \frac{\left[(1 + \delta I) \odot E_r\right] v'_r}{(1 + \delta I) v'^{\odot 2}_r} \tag{Eq. 6}

where the numerator is a vector in $\mathbb{R}^m$ (matrix-vector product with element-wise pre-weighting) and the denominator is also in $\mathbb{R}^m$ (denominator for each row $i$ ).

Extracting the singular value and normalizing:

\sigma'_r \leftarrow \|\tilde{u}'_r\|_2, \qquad u'_r \leftarrow \tilde{u}'_r / \sigma'_r

Figure 3 — ALS Iteration Flowchart

flowchart TD
    INIT["Initialize from SVD-LLM(W):\nU'_k, Sigma'_k, V'_k = SVD(W', k)\n(globally optimal for delta=0)"]
    SET["Set r = k-1\n(start from least significant rank)"]
    INIT --> SET

    RES["Compute rank-r residual:\nE_r = W' - W'_k + sigma'_r u'_r v'^T_r\n(add back rank-r component)"]
    SET --> RES

    UPVR["Update v'_r via Eq.5\n(weighted least squares, fixed u'_r)\nv'^T_r = numerator / denominator\n(closed-form)"]
    RES --> UPVR

    UPUR["Update u_tilde'_r via Eq.6\n(weighted least squares, fixed v'_r)\nu_tilde'_r = numerator / denominator\n(closed-form)"]
    UPVR --> UPUR

    EXTR["Extract sigma and normalize:\nsigma'_r = ||u_tilde'_r||_2\nu'_r = u_tilde'_r / sigma'_r"]
    UPUR --> EXTR

    DEC["r = r - 1"]
    EXTR --> DEC

    CHK{"r >= 0?"}
    DEC --> CHK

    CHK -- "Yes (continue backward sweep)" --> RES
    CHK -- "No (sweep complete)" --> DONE["Final influenced factors:\nU'_k (columns updated), Sigma'_k, V'_k (rows updated)\nL_act,infl is non-increasing throughout"]

Algorithm 1 — AIR Weight Compression

Input:  W ∈ R^{m×n}   (original weight matrix for one linear layer)
        k              (target rank, k < min(m,n))
        D_cal          (calibration data, ~128 samples)
        δ ≥ 0          (influence weight hyperparameter)
Output: U_k ∈ R^{m×k}, V^T_k ∈ R^{k×n}  such that W_k = U_k V^T_k

--- Forward analysis ---
Step  1: Run full model forward on D_cal; collect hidden states X at this layer's input
Step  2: Sigma ← Σ_{d ∈ D_cal} X_d X_d^T       (activation covariance, n×n)
Step  3: S ← cholesky(Sigma + ε I)               (profiling matrix, lower-triangular)
Step  4: W' ← W · S                               (activation-whitened weight, Eq.1)

--- Backward analysis ---
Step  5: Initialize relevance R = f(x) at model output for each sample d in D_cal
Step  6: Propagate backward via AttnLRP (ε=1e-6) through all layers to this layer
Step  7: Extract per-weight relevance: I_tilde(d) ∈ R^{m×n}  for each sample
Step  8: I ← Σ_{d ∈ D_cal} |I_tilde(d)|          (aggregate element-wise influence)
Step  9: I ← I / mean(I)                           (normalize to unit mean per layer)

--- ALS warm-start from SVD-LLM(W) ---
Step 10: U'_k, Sigma'_k, V'_k ← SVD(W', k)        (globally optimal for δ=0)

--- Single backward ALS sweep ---
Step 11: for r = k-1 downto 0:
Step 12:   E_r ← W' - W'_k + σ'_r u'_r v'^T_r    (rank-r residual)
Step 13:   v'_r ← Eq.5(u'_r, σ'_r, E_r, I, δ)    (closed-form WLS update for v)
Step 14:   ũ'_r ← Eq.6(v'_r, E_r, I, δ)           (closed-form WLS update for u)
Step 15:   σ'_r ← ||ũ'_r||_2
Step 16:   u'_r ← ũ'_r / σ'_r

--- Back-project to native space ---
Step 17: U_k ← U'_k  √Sigma'_k                     (absorb sqrt(Sigma') into U)
Step 18: V^T_k ← √Sigma'_k  (V'^T_k  S^{-1})       (undo whitening: multiply by S^{-1})
Step 19: return U_k, V^T_k

Line-by-line explanation:

Steps 1–4 (Forward analysis): A standard forward pass on calibration data captures hidden states $X$ at each layer’s input. The covariance $\Sigma$ encodes which input directions are active. Adding $\epsilon I$ regularizes near-singular covariances. The Cholesky factor $S$ is the “square root” of the covariance, so $W' = WS$ lives in a space where all active input directions are isotropically represented. SVD on $W'$ then allocates approximation capacity to the directions that matter for actual input distributions.

Steps 5–9 (Backward analysis): AttnLRP is initialized at the network’s output relevance (the predicted logit or next-token probability) and propagated backward with ε-LRP rules adapted for transformer attention. The $\epsilon = 10^{-6}$ stabilizer prevents division-by-zero in the LRP denominator. For each sample $d$ , $\tilde{I}(d)_{ij}$ quantifies how much weight element $(i,j)$ contributed to the output relevance for that sample. Summing absolute values over samples and normalizing to unit mean makes $\delta$ a dimensionless weighting hyperparameter consistent across all layers.

Step 10 (ALS warm start): By Eckart-Young, $\text{SVD}(W', k)$ is the globally optimal solution to $\mathcal{L}_{act}$ (the $\delta = 0$ case of AIR). Starting here means the ALS sweep begins at the activation-aware optimum and can only improve in $\mathcal{L}_{act,infl}$ .

Steps 11–16 (ALS backward sweep): For each rank index from $k-1$ down to $0$ , the residual $E_r$ isolates the $r$ -th rank component. Equation 5 gives the globally optimal $v'_r$ for fixed $u'_r$ and the current residual — a closed-form weighted least squares expression with no iterative steps required. Equation 6 similarly gives the optimal unnormalized $\tilde{u}'_r$ for the updated $v'_r$ . Both updates are globally optimal in their subspace, guaranteeing monotone descent in $\mathcal{L}_{act,infl}$ . The backward direction (from $r = k-1$ toward $r = 0$ ) ensures that when we update the major components (small $r$ ), all minor components are already in their corrected state.

Steps 17–18 (Back-projection): The ALS factors $U'_k, \Sigma'_k, V'_k$ live in the whitened space and approximately reconstruct $W' = WS$ . Absorbing $\sqrt{\Sigma'_k}$ into $U_k$ and right-multiplying by $S^{-1}$ undoes the whitening: $U_k V_k^\top \approx W$ . This is an exact algebraic inversion with no approximation beyond the rank- $k$ truncation.

Step 19 (Return): At inference, $y \approx W_k x = U_k (V_k^\top x)$ , computed as two sequential matrix-vector products with total cost $k(m+n)$ MACs.

3.5 Monotone Descent Guarantee (Proposition 3.1)

Formal statement: Starting from $U'_k, \Sigma'_k, V'_k = \text{SVD}(W', k)$ and applying Equations 5–6 for $r = k-1, k-2, \ldots, 0$ , the objective $\mathcal{L}_{act,infl}$ is non-increasing at every update step.

Proof sketch: Consider the update of $v'_r$ (Eq. 5). Fixing all other variables, the objective $\mathcal{L}_{act,infl}$ is a convex quadratic in $v'_r$ (weighted Frobenius norm in a rank-1 component). Equation 5 solves this to global optimality. Since the global optimum of a convex subproblem cannot be worse than the current point, $\mathcal{L}_{act,infl}$ does not increase after the $v'_r$ update. The same argument applies to the $\tilde{u}'_r$ update (Eq. 6). Two non-increasing steps per rank means the full sweep is non-increasing at every sub-step. The backward direction (from minor to major components) ensures there are no circular dependencies between rank updates.

Consequence: The AIR output satisfies

\mathcal{L}_{act,infl}(\text{AIR output}) \leq \mathcal{L}_{act,infl}(\text{SVD-LLM(W) initialization})

Since SVD-LLM(W) is itself optimal for $\mathcal{L}_{act} = \mathcal{L}_{act,infl}\big|_{\delta=0}$ , AIR is guaranteed to be no worse in the hybrid objective. Empirically, lower $\mathcal{L}_{act,infl}$ reliably translates to lower perplexity.

Important caveat: The guarantee applies to the proxy objective $\mathcal{L}_{act,infl}$ , not to downstream perplexity or task accuracy. If $\mathcal{L}_{act,infl}$ is poorly aligned with the model’s actual loss landscape (e.g., due to pathological calibration data or very aggressive compression), the proxy improvement might not translate to perplexity improvement. In all experiments reported in the paper, this alignment holds.

3.6 Back-Projection and Inference Acceleration

After the ALS sweep, the compressed factors are in the whitened activation space. The back-projection step transforms them back to the native parameter space:

U_k = U'_k \sqrt{\Sigma'_k}, \qquad V^\top_k = \sqrt{\Sigma'_k} \cdot \left(V'^{\top}_k S^{-1}\right)

Correctness check: $U_k V_k^\top = U'_k \sqrt{\Sigma'_k} \cdot \sqrt{\Sigma'_k} V'^{\top}_k S^{-1} = U'_k \Sigma'_k V'^{\top}_k S^{-1} \approx W' S^{-1} = WS \cdot S^{-1} = W$ . The approximation comes only from the rank- $k$ truncation.

The computation cost of back-projection is dominated by the matrix multiply $V'^{\top}_k S^{-1}$ , which costs $O(k \cdot n^2)$ . For $k \ll n$ , this is much cheaper than the original matrix operations. $S^{-1}$ is computed by back-substitution on the lower-triangular $S$ (Cholesky), not explicit inversion.

Inference model after compression:

y = W_k x = U_k (V^\top_k x)

Figure 4 — Inference Pipeline: Before and After AIR Compression

flowchart LR
    subgraph BEFORE["Original Linear Layer"]
        direction TB
        XI["Input x in R^n"] 
        WO["Weight W in R^(m x n)\nMACs = m * n"]
        YO["Output y in R^m"]
        XI --> WO --> YO
    end

    subgraph AFTER["AIR-Compressed Layer (rank k)"]
        direction TB
        XC["Input x in R^n"]
        VT["Step 1: V^T_k x\nV^T_k in R^(k x n)  MACs = k*n"]
        HID["Hidden h in R^k"]
        UK["Step 2: U_k h\nU_k in R^(m x k)  MACs = k*m"]
        YC["Output y_approx in R^m"]
        XC --> VT --> HID --> UK --> YC
    end

    BEFORE -- "AIR compression:\nretain rank k" --> AFTER

    SAVINGS["Parameter savings:\nmn to k(m+n)\nFLOP savings: same ratio\nat 60pct retention (k=0.3n):\n~40pct fewer MACs\n~64pct less peak memory\n~53pct lower latency (A100)"]
    AFTER --> SAVINGS

The memory reduction is larger than the FLOP reduction because storing two smaller matrices ( $U_k, V_k^\top$ ) requires fewer bytes than the original $W$ , and the reduced intermediate activation size ( $k < n$ ) also shrinks the KV cache and activation buffers. The 53 pct latency improvement at 60 pct retention exceeds the FLOP prediction because large transformer matrix-vector products are memory-bandwidth-bound, not compute-bound, during autoregressive decoding. Smaller matrices load faster from HBM to compute units, improving effective hardware utilization.

Experiments

4.1 Methods Without Enhancements

The primary benchmark is LLaMA-7B compressed with 128 calibration samples from WikiText-2 training data. Evaluation covers WikiText-2 perplexity (PPL), C4 PPL (out-of-distribution), and an aggregate of four reasoning benchmarks (ARC-challenge, ARC-easy, WinoGrande, HellaSwag).

Figure 5 — Perplexity Gains of AIR over SVD-LLM(W) Across Retention Rates

flowchart LR
    subgraph HIGH["High-Retention Regime (80-60pct params)"]
        direction TB
        H80["80pct params\nWikiText-2 PPL:\nSVD-LLM 7.87 vs AIR 7.51\nGain: 4.6pct"]
        H60["60pct params\nWikiText-2 PPL:\nSVD-LLM 13.81 vs AIR 11.27\nGain: 18.4pct"]
    end

    subgraph LOW["Aggressive-Compression Regime (40-20pct params)"]
        direction TB
        L40["40pct params\nWikiText-2 PPL:\nSVD-LLM 63.83 vs AIR 42.52\nGain: 33.4pct"]
        L20["20pct params\nWikiText-2 PPL:\nSVD-LLM 854 vs AIR 472\nGain: 44.7pct"]
    end

    INSIGHT["Key trend:\nGains compound with compression\naggressiveness. Influence-aware\nelement selection is most\ncritical under tight budgets."]

    HIGH --> INSIGHT
    LOW --> INSIGHT

Full results table (LLaMA-7B):

Param Rate	Vanilla SVD	ASVD	SVD-LLM(W)	AIR	WikiText-2 PPL gain
100 pct	5.68	5.68	5.68	5.68	—
80 pct	19438	116	7.87	7.51	−4.6 pct
60 pct	52839	—	13.81	11.27	−18.4 pct
40 pct	—	—	63.83	42.52	−33.4 pct
20 pct	—	—	854	472	−44.7 pct

C4 gains are larger. C4 PPL improvements range from 14.5 pct (at 80 pct retention) to 70.4 pct (at 20 pct). Since the calibration data is WikiText-2 (not C4), these gains demonstrate that AIR’s influence-aware compression generalizes across domains. The backward signal captures structural importance that is not domain-specific: weight elements that influence predictions on one distribution tend to influence predictions on others.

FWSVD diagnostic. FWSVD at 80 pct retention achieves WikiText-2 PPL of 22026 — worse than even vanilla SVD (19438). This is not a marginal failure; it is catastrophic. The backward signal in FWSVD actively hurts because row-wise Fisher aggregation provides misleading saliency without activation awareness to compensate. This strongly validates AIR’s design choice: element-wise spatial structure is non-negotiable.

Reasoning benchmarks. Across ARC, WinoGrande, and HellaSwag, AIR consistently outperforms SVD-LLM(W). The aggregate gain at 60 pct retention is +1.6 pct (41.6 vs 40.0). These benchmarks measure real model capability (not just perplexity), confirming that influence-aware compression preserves functional knowledge better than pure activation-aware compression.

4.2 AIR + LoRA vs ACIP

Low-Rank Adaptation (LoRA) adds trainable rank- $r$ perturbations $\Delta W = BA$ (with $B \in \mathbb{R}^{m \times r}$ , $A \in \mathbb{R}^{r \times n}$ , $r \ll k$ ) to the frozen compressed weights. After AIR compresses a layer to $W_k = U_k V_k^\top$ , LoRA trains small correction matrices on a calibration dataset.

AIR+LoRA outperforms ACIP at every retention rate tested. This matters because ACIP is itself a full end-to-end optimization method with access to gradient-based joint optimization across all layers — significantly more compute than AIR’s layer-local procedure. The fact that a locally-optimized compression (AIR) combined with small LoRA adapters beats a globally-optimized method (ACIP) indicates that AIR provides a qualitatively better low-rank subspace, not just a marginally better one.

The composability is by design. AIR does not impose any constraint on the compressed factors that would interfere with subsequent fine-tuning. The two-factor representation $U_k, V_k^\top$ is directly compatible with standard LoRA implementations, which also decompose weight updates into low-rank factors.

Practical implication: For any deployment pipeline that includes fine-tuning (instruction following, domain adaptation, alignment), AIR+LoRA is the recommended choice. For inference-only deployment (no fine-tuning budget), AIR alone provides the best local compression quality.

4.3 Cross-Family Generalization

The paper reports consistent AIR improvements over SVD-LLM(W) across multiple LLaMA family sizes beyond the primary LLaMA-7B results. While the specific cross-family numbers are not detailed in the paper abstract, the consistent gains support the conclusion that AIR’s mechanism is not architecture- or scale-specific.

The underlying reason for generalization is that the key properties exploited by AIR — activation covariance structure and element-wise weight-output influence — are fundamental to the transformer architecture, not artifacts of any particular model size or configuration. Any transformer with attention and feedforward layers will exhibit the same basic patterns.

4.4 Calibration Data Efficiency

A practical concern for SVD-based compression is calibration data cost. Collecting hidden states for covariance estimation and running backward passes for influence scores both require multiple passes through a large model. Standard calibration uses 128 samples; the question is how much this can be reduced.

AIR achieves SVD-LLM(W)-equivalent quality with approximately 10× fewer calibration samples (≈13 samples). This result has significant practical implications:

Low-data scenarios: Organizations compressing proprietary fine-tuned models may have limited calibration data that represents the target distribution.
Compute constraints: Running 128 vs. 13 backward passes through a 7B model is a ~10× calibration speedup.
Rapid iteration: With 10× fewer samples, compression experiments run faster, enabling more thorough hyperparameter search.

The efficiency gain arises because the backward signal provides richer information per sample than the covariance alone. A single backward pass captures first-order sensitivity of the loss — information about the global loss landscape that many forward passes would need many samples to approximate from second-order statistics alone. The combination of forward (covariance, second-order) and backward (influence, first-order) signals is more informative per sample than either alone.

System-Level Efficiency

The system efficiency measurements at 60 pct parameter retention on a single NVIDIA A100 40GB GPU translate the perplexity improvements into deployment-relevant hardware numbers.

Figure 6 — System Efficiency Gains: AIR at 60 pct Retention vs Baseline (A100 40GB)

flowchart TD
    subgraph BASE["Baseline: LLaMA-7B at 100pct Retention"]
        direction LR
        BM["Peak GPU Memory\n~13.5 GB (FP16)"]
        BL["Per-token Latency\nbaseline = 1.0x"]
        BC["Calibration Cost\n0 min"]
    end

    subgraph AIR60["AIR: LLaMA-7B at 60pct Retention"]
        direction LR
        AM["Peak GPU Memory\n~4.9 GB\n64pct REDUCTION"]
        AL["Per-token Latency\n0.47x of baseline\n53pct REDUCTION"]
        AC["Calibration Cost\n~12 min (one-time)\namortized over all inference"]
    end

    BASE --> AIR60

    subgraph DEPLOY["Deployment Impact"]
        direction TB
        GPU["Fits A100 40GB  -->  Fits T4 16GB\nor 2.7x more models per GPU"]
        THRU["2.1x throughput improvement\ncritical for serving SLAs"]
        CAL["12 min per model, not per query\nnegligible for production deploys"]
    end

    AIR60 --> DEPLOY

Peak memory: 64 pct reduction. The 64 pct reduction is larger than the 40 pct parameter reduction would naively suggest. This is because in addition to weight storage savings, the smaller intermediate activation dimensions (key/value cache, feedforward hidden states) reduce activation memory. For LLaMA-7B at FP16, the full model requires ~13.5 GB; at 60 pct retention it drops to ~4.9 GB. This enables deployment on smaller GPUs (e.g., NVIDIA T4 16GB or RTX 4090 24GB) without quantization, or allows fitting significantly more model replicas per GPU server.

Per-token latency: 53 pct reduction. This translates to roughly 2.1× throughput improvement. For serving applications constrained by latency SLAs, this is a compelling gain. The 53 pct improvement exceeds the naive FLOP-count prediction of ~40 pct, because autoregressive decoding is primarily memory-bandwidth-bound rather than compute-bound. Smaller weight matrices load faster from HBM, and the reduced hidden dimension shrinks the KV cache accessed per attention head, yielding better effective bandwidth utilization.

Calibration time: ~12 minutes. AIR adds one ALS sweep on top of SVD-LLM(W)‘s forward covariance pass. For LLaMA-7B, the total calibration (forward pass, backward LRP pass, ALS sweep) takes approximately 12 minutes on an A100. This is a one-time cost: compress once, serve indefinitely. For any deployment with more than ~100K tokens served, the calibration cost is negligible per token.

Comparison with alternative compression methods on calibration cost:

Method	Calibration Requirement	Approximate Time (7B)
Vanilla SVD	None — pure linear algebra	Seconds
ASVD	Forward pass only	1–2 min
SVD-LLM(W)	Forward pass + covariance	3–5 min
FWSVD	Forward + backward (row-wise Fisher)	5–8 min
AIR	Forward + backward + ALS sweep	~12 min
ACIP	End-to-end gradient optimization	Hours

AIR sits firmly between the cheap local methods and ACIP’s full optimization in both calibration cost and compression quality. It delivers most of ACIP’s quality gain (and sometimes more, when combined with LoRA) at a fraction of the calibration cost.

Ablations: Which Influence Signal?

The most surprising and practically important ablation in the paper tests whether the specific backward signal used to compute $I$ matters. Three signals are evaluated:

LRP-ε (AttnLRP): Transformer-adapted layer-wise relevance propagation initialized at the output relevance and propagated backward through attention and feedforward layers with $\epsilon = 10^{-6}$ . Provides interpretable per-weight relevance scores with clear theoretical motivation.

Weight × Gradient (W×G): Element-wise product of weight values and gradients from a standard autograd backward pass. The simplest possible backward signal requiring no custom implementation.

Per-element Fisher: Element-wise outer product of gradients, summed over calibration samples. Finer-grained than FWSVD’s row-wise Fisher — it retains the column structure.

Result: All three yield essentially identical WikiText-2 perplexity after ALS integration. Differences are within measurement noise.

Why this matters: This finding delivers two key messages:

The ALS integration mechanism is the innovation, not the backward signal. The element-wise weighting of the Frobenius objective and the rank-by-rank ALS sweep are what drive the improvement. Any element-wise backward signal — including the cheapest one (W×G) — suffices.
FWSVD’s failure is due to granularity loss, not signal quality. Fisher information (used in FWSVD) performs identically to LRP and W×G when computed at element resolution. The row-wise aggregation in FWSVD is the sole culprit for its catastrophic failure.

Practical recommendation: For deployment, use W×G as the influence signal. It requires one standard backward pass through the model — no custom AttnLRP implementation, no special attention handling. The quality is identical to LRP.

Limitations and Boundary Conditions

Layer-wise rank allocation is external. AIR takes the target rank $k$ as a given input per layer. The question of how to allocate a global parameter budget across layers — which layers are most sensitive and should retain more rank — is left to a separate pre-processing step. The paper uses uniform retention rates for comparability, but non-uniform allocation (e.g., based on layer sensitivity scores computed from the influence matrix) would likely improve quality at fixed overall budget.

Single ALS sweep. The paper uses exactly one sweep from $r = k-1$ to $r = 0$ . The monotone descent guarantee applies to any number of sweeps, and additional sweeps could improve quality further. No ablation over sweep count is provided. Given the modest 12-minute calibration time, running 3–5 sweeps would add at most 30–40 extra minutes and might yield measurable additional gains at low retention rates.

Calibration distribution sensitivity. AIR’s influence matrix $I$ reflects the functional importance of weight elements under the calibration distribution. For heavily domain-specific models or models fine-tuned for narrow tasks, generic calibration data (e.g., WikiText-2) may misrepresent the model’s operational distribution. The 10× calibration efficiency advantage may shrink under distribution shift between calibration and inference.

Quantization interaction unstudied. Modern deployment pipelines frequently stack compression methods: quantization (INT8, FP8, or INT4) on top of structural compression (SVD). AIR’s paper does not explore whether influence-aware compression interacts favorably or unfavorably with subsequent quantization. The element-wise weighting by $I$ could amplify or attenuate quantization noise in important vs. unimportant elements in unexpected ways.

Non-standard architectures. AttnLRP is specifically designed for transformer attention operations. Applying AIR to non-transformer architectures — SSMs like Mamba, MoE layers with routing, or convolutional backbones — requires re-deriving the LRP propagation rules for those operations. The W×G fallback signal works universally via standard autograd, but its relative performance vs. LRP for non-transformer models is unknown.

Evaluation scope. All quantitative results are on LLaMA family models with WikiText-2 and C4 evaluation. Instruction-tuned models (e.g., LLaMA-2-Chat, Mistral-7B-Instruct), code models (e.g., CodeLlama), and models from non-LLaMA families (Qwen, Gemma, Phi) are not evaluated. Whether the gains generalize to RLHF-aligned models — where the loss landscape and weight importance structure may differ substantially from the base model — is an open question.

Critical Assessment: Weaknesses and Improvements

What the Paper Does Well

Diagnostic clarity. The paper’s greatest contribution is not the algorithm itself but the diagnosis of why FWSVD fails and the minimal fix that resolves it. The progression from “backward signals help → FWSVD uses them → FWSVD is catastrophically worse → element-wise resolution is the missing link” is clean and convincing. The ablation that confirms signal-agnosticism seals the argument.

Formal guarantee. Proposition 3.1 is meaningful and tight. It is not a mere existence proof — it says every single sub-step of the ALS sweep is non-increasing in $\mathcal{L}_{act,infl}$ . This is a practically useful guarantee: AIR can be adopted as a drop-in replacement for SVD-LLM(W) with no risk of regression.

Calibration efficiency. The 10× fewer samples result is a practically important finding that the paper underplays. For many deployment scenarios, calibration data availability is the binding constraint, not compute. Highlighting this advantage more prominently would make the paper more actionable.

LoRA composability. The framework choice to keep AIR layer-local and free of global optimization constraints pays off: AIR+LoRA cleanly outperforms ACIP without requiring any architectural changes to the fine-tuning pipeline.

Weaknesses and Omissions

W1 — Evaluation breadth is insufficient for ICLR/NeurIPS standards (workshop level is acceptable). All main results are on LLaMA-7B. The cross-family section mentions generalization but without complete quantitative tables. Reviewers at top venues would require LLaMA-13B/70B, Mistral-7B, and at least one instruction-tuned model. The reasoning benchmark aggregate (ARC + WinoGrande + HellaSwag) omits harder tasks like MMLU, GSM8K, and HumanEval that better discriminate compression quality.

W2 — δ hyperparameter selection protocol is absent. The paper reports a single δ value but provides no guidance on how to choose it for a new model. Is it consistent across model families? Does it interact with the retention rate? Without a principled or automated selection method, practitioners must perform expensive per-model grid search. A simple validation perplexity proxy on a 16-sample held-out split would suffice and adds negligible cost.

W3 — Number of ALS sweeps is not ablated. The “single sweep” choice is stated but not justified by ablation. Given the monotone descent guarantee, additional sweeps are free in terms of quality risk. An ablation showing quality vs. sweep count (1, 3, 5, 10) would establish whether one sweep is near-optimal or if there are meaningful gains from more sweeps — especially at aggressive compression rates (20–40 pct).

W4 — ACIP comparison is incomplete. AIR alone does not consistently outperform ACIP; the comparison requires adding LoRA. The paper should be more explicit about when AIR alone is the right choice (no fine-tuning budget, resource-constrained edge deployment) vs. when AIR+LoRA (with fine-tuning budget) is appropriate. A fair comparison of ACIP+LoRA vs. AIR+LoRA would also be informative.

W5 — Compression overhead during calibration is unreported. The 12-minute calibration time is given, but peak VRAM during compression (storing full influence matrices $I \in \mathbb{R}^{m \times n}$ and residuals $E_r \in \mathbb{R}^{m \times n}$ per layer) is not reported. For LLaMA-7B, a 4096×4096 matrix in FP32 is 64 MB; with all layers active simultaneously, compression-time VRAM could be substantial. Practitioners need this to know whether compression is feasible on the same GPU as inference.

W6 — The (1 + δI) anchor form is presented as given, not derived. Alternative formulations — $e^{\delta I}$ , $\text{softplus}(\delta I)$ , $\max(\alpha, \delta I)$ with tunable floor $\alpha$ — are not explored. For extreme compression ratios (20 pct retention), a different anchor form might provide better calibration between activation-aware and influence-aware penalties.

Concrete Improvement Suggestions

A. Automated δ selection via validation PPL proxy. Hold out 10–15 pct of calibration samples. After the SVD-LLM(W) initialization, evaluate three δ values (e.g., 0.1, 1.0, 10.0) via a single ALS sweep each and pick the one minimizing validation PPL. This adds only 3× the current calibration cost and removes the hyperparameter tuning burden.

B. Multi-sweep ALS with convergence threshold. Replace the fixed single sweep with a loop: repeat sweeps until $\mathcal{L}_{act,infl}$ decreases by less than $\epsilon_{rel}$ relative. The loop is safe by Proposition 3.1. A practical implementation could cap at 5 sweeps. An ablation over sweep count from 1 to 10 at 40 pct retention (where gains are largest) would characterize the quality-cost tradeoff.

C. Sensitivity-guided layer-wise rank allocation. Use the per-layer influence matrix magnitude (e.g., $\sum_{ij} I_{ij}^2$ or the nuclear norm of $I$ ) as a proxy for layer sensitivity. Allocate more rank to high-sensitivity layers within a fixed global parameter budget. This would decouple AIR from the need for external rank allocation methods and may significantly improve quality at fixed budget.

D. Broader evaluation. Evaluate LLaMA-7B and LLaMA-13B, plus Mistral-7B and its instruction-tuned variant. Add GSM8K (math reasoning) and MMLU (knowledge breadth) as downstream benchmarks. Evaluate AIR on quantized baselines (FP8 weights + AIR compression) to assess stacking behavior.

E. Memory-efficient ALS implementation. To reduce compression-time VRAM, the ALS sweep can be implemented layer-by-layer with the influence matrix computed and discarded per-layer. Reporting VRAM usage during compression alongside calibration time would make the paper more practically useful.

Conclusion

AIR is a technically elegant, practically impactful contribution to SVD-based LLM compression. Its core insight — that element-wise integration of a backward-signal influence matrix atop activation-whitened SVD yields large quality gains at low additional cost — is cleanly motivated, formally guaranteed, and empirically validated.

The diagnostic framing (why FWSVD fails, what AIR fixes) is the paper’s strongest conceptual contribution. The finding that all element-wise backward signals perform equivalently is a practically liberating result: any transformer with autograd support can use W×G as the influence signal without specialized AttnLRP implementation.

Perplexity improvements are substantial at the most practically relevant compression ratios: 18 pct at 60 pct retention (the sweet spot for memory vs. quality tradeoffs), growing to 45 pct at 20 pct retention (extreme compression). System-level numbers (64 pct memory reduction, 53 pct latency reduction on A100) are deployment-relevant. The 10× calibration data efficiency is underemphasized but practically important for production deployments.

Weaknesses are mostly evaluative scope limitations consistent with a workshop paper: narrow model family coverage, missing ablations over sweep count and δ, and incomplete comparison with ACIP under LoRA fine-tuning. The mathematical framework and engineering design choices are sound and well-argued.

For practitioners compressing open-weight LLMs for inference deployment, AIR represents a clear upgrade over SVD-LLM(W) with a favorable cost/benefit ratio: 12 extra minutes of calibration for 18–45 pct perplexity improvement at the retention rates that matter most. The formal non-degradation guarantee makes it safe to adopt as a default in any compression pipeline that already uses SVD-LLM(W).

Future directions include layer-wise rank allocation guided by the influence matrix, multi-sweep ALS with convergence criteria, and combination with structured sparsity or quantization for further efficiency gains.

References

Eckart-Young theorem: Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218.
ASVD: Zhao, C., et al. (2024). ASVD: Activation-aware singular value decomposition for compressing large language models. arXiv:2312.05821.
SVD-LLM: Wang, Z., et al. (2024). SVD-LLM: Singular value decomposition for large language model compression. arXiv:2403.07378.
FWSVD: Han, Y., et al. (2023). Factorized weight matrices via singular value decomposition for large language model compression. In Proceedings of ACL 2023.
ACIP: Guo, C., et al. (2024). ACIP: Adaptive channel importance for parameter-efficient LLM compression. arXiv preprint.
LoRA: Hu, E. J., et al. (2022). LoRA: Low-rank adaptation of large language models. In ICLR 2022.
AttnLRP: Achtibat, R., et al. (2024). AttnLRP: Attention-aware layer-wise relevance propagation for transformers. In Proceedings of ICML 2024.
LRP original: Bach, S., et al. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7).
LLaMA: Touvron, H., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971.
Fisher pruning (OBD/OBS): LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In NeurIPS 1990. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In NeurIPS 1992.
AIR (this paper): Harder, N., et al. (2026). Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs. ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM), Seoul, South Korea. arXiv:2606.19993.