SliceGPT: Post-Training LLM Compression via Computational Invariance

Review date: 2026-06-12 Review author: Zhongzhu Zhou Paper reviewed: SliceGPT: Compress Large Language Models by Deleting Rows and Columns Paper authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Grangeiro Perez, Torsten Hoefler, James Hensman arXiv: 2401.15024 Status / Venue: ICLR 2024 (accepted); Microsoft Research + ETH Zürich; 22 pages, 8 figures

Short Answer

SliceGPT proposes a post-training compression scheme built on a structural mathematical insight called computational invariance: any orthogonal change-of-basis applied simultaneously to consecutive weight matrices cancels out exactly, leaving the model’s outputs unchanged. The authors use PCA over calibration activations to find the basis in which the residual stream’s last few directions carry near-zero variance, then physically remove those rows and columns from the weight matrices. The result is a set of smaller, fully dense weight matrices that run faster on standard hardware with no custom CUDA kernels. At 25% parameter reduction, LLAMA2-70B retains 99% of its zero-shot performance while inference compute drops to 64–66% of the original.

Prerequisites

1. Transformer Architecture Fundamentals

A modern decoder-only transformer (GPT, LLAMA, OPT) is a stack of LL transformer blocks, each containing:

  1. RMS Layer Normalization — normalizes the residual stream by its RMS and scales by a learned vector γRd\gamma \in \mathbb{R}^d
  2. Multi-Head Self-Attention — applies Q/K/V projections, scaled dot-product attention, and an output projection
  3. MLP / Feed-Forward Network — an up-projection, a pointwise nonlinearity (GeLU, SiLU), and a down-projection
  4. Residual connections — the output of every sub-block is added back to the input

The central data structure flowing through the network is the residual stream: a tensor of shape (seq_len,d)(\text{seq\_len}, d) where dd is the model dimension (also called hidden size or embedding dimension). In LLAMA2-7B, d=4096d = 4096; in LLAMA2-70B, d=8192d = 8192.

Every linear layer in the transformer operates on this residual stream: it reads a vector from the stream, multiplies by a weight matrix, and either writes back to the stream (output projections) or produces an intermediate tensor (Q/K/V). The dimension dd is the bottleneck that SliceGPT targets.

2. Singular Value Decomposition (SVD)

For any matrix ARm×nA \in \mathbb{R}^{m \times n}, the SVD factorizes it as:

A=UΣVTA = U \Sigma V^T

where:

  • URm×mU \in \mathbb{R}^{m \times m} — orthonormal left singular vectors (columns form an orthonormal basis of Rm\mathbb{R}^m)
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n} — diagonal matrix of singular values σ1σ20\sigma_1 \ge \sigma_2 \ge \cdots \ge 0
  • VRn×nV \in \mathbb{R}^{n \times n} — orthonormal right singular vectors

The Eckart–Young theorem gives the best rank-kk approximation:

Ak=UkΣkVkT,withAAkF=σk+12++σr2A_k = U_k \Sigma_k V_k^T, \quad \text{with} \quad \|A - A_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}

SliceGPT does not apply SVD directly to weight matrices (that would be ordinary low-rank compression). Instead it uses SVD to find the optimal change of basis for the activations — a conceptually different use of the same tool.

3. Principal Component Analysis (PCA) and Its Geometry

Given a data matrix XRd×nX \in \mathbb{R}^{d \times n} whose columns are activation samples, PCA finds the orthogonal transformation QRd×dQ \in \mathbb{R}^{d \times d} such that the covariance of QXQX is diagonal:

Cov(QX)=QXXTnQT=diag(λ1,,λd)\text{Cov}(QX) = Q \cdot \frac{XX^T}{n} \cdot Q^T = \text{diag}(\lambda_1, \ldots, \lambda_d)

with λ1λ2λd0\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d \ge 0. The rows of QQ are the eigenvectors of the empirical covariance 1nXXT\frac{1}{n} XX^T, sorted by descending eigenvalue. The eigenvalue λi\lambda_i measures the variance of the activations in the ii-th principal direction.

In PCA, after transforming XQXX \mapsto QX, the last few coordinates of QXQX have variance λdk,,λd0\lambda_{d-k}, \ldots, \lambda_d \approx 0. These coordinates are effectively zero in every sample — carrying no information. Discarding them is essentially lossless.

4. Orthogonal Matrices: The Key Algebraic Tool

A matrix QRd×dQ \in \mathbb{R}^{d \times d} is orthogonal if QQT=QTQ=IQQ^T = Q^TQ = I. Its critical properties:

  • Norm-preserving: Qx2=x2\|Qx\|_2 = \|x\|_2 for all xx (orthogonal transforms are rigid rotations/reflections)
  • Exact inverse: Q1=QTQ^{-1} = Q^T (cheap to invert)
  • Exact identity insertions: QTQ=IQ^TQ = I, so inserting QTQQ^TQ anywhere in a product leaves it unchanged

The last property is the crux of SliceGPT. Inserting I=QTQI = Q^TQ between two weight matrices changes the parameterization but not the computation — and choosing QQ wisely (via PCA) reveals low-variance directions that can be discarded.

5. Post-Training Compression: The Landscape

Post-training compression reduces model size or compute after training, using only forward passes on a small calibration dataset. Three main paradigms:

MethodStrategyAcceleration MechanismCustom Kernel?
Quantization (GPTQ, AWQ)Reduce precision (FP16→INT4)Less memory bandwidthPartial (dequant.)
Unstructured Sparsity (SparseGPT, Wanda)Zero individual weightsSparse GEMMYes
Structured Compression (SliceGPT, LLM-Pruner)Remove entire dimensionsSmaller dense GEMMNo

SliceGPT is a structured method: it removes complete rows and columns, leaving matrices that are still dense but smaller. This means standard highly-optimized dense BLAS libraries (cuBLAS, oneDNN) work without modification.

6. Computational Complexity Preview

For a transformer layer with residual-stream dimension dd and MLP intermediate dimension dffd_\text{ff}, per-layer compute is approximately:

FLOPs2×(3d2+d2+2ddff)=2(4d2+2ddff)\text{FLOPs} \approx 2 \times (3d^2 + d^2 + 2d \cdot d_\text{ff}) = 2(4d^2 + 2d \cdot d_\text{ff})

If SliceGPT reduces dk=(1s)dd \to k = (1-s)d with s=0.25s = 0.25, then k=0.75dk = 0.75d and the compute scales as (k/d)2=0.5625(k/d)^2 = 0.5625 for the d2d^2 terms and (k/d)=0.75(k/d) = 0.75 for the ddffd \cdot d_\text{ff} terms. The blended reduction is approximately 64–66%, matching the paper’s empirical measurements.

What SliceGPT Does: Overview

SliceGPT (Ashkboos et al., Microsoft Research + ETH Zürich, ICLR 2024) makes three contributions:

Contribution 1 — Computational invariance theorem. A formal proof that for any sequence of orthogonal matrices {Q0,Q1,,QL}\{Q_0, Q_1, \ldots, Q_L\}, there exists a reparameterization of every transformer weight matrix such that the model’s output is exactly preserved for all inputs.

Contribution 2 — A principled slicing algorithm. Using PCA on calibration-data activations, the algorithm (a) identifies the optimal orthogonal basis at each layer, (b) rotates the weights into this basis, and (c) physically truncates the weight matrices by removing the last dkd - k rows/columns (the directions with near-zero activation variance).

Contribution 3 — Hardware-native deployment. The sliced model consists only of smaller dense matrices, running on standard hardware without any new infrastructure, achieving actual latency and GPU-count reductions.

The Core Insight: Computational Invariance

Formal Derivation

Setup. Consider two consecutive linear operations separated by an element-wise nonlinearity ϕ\phi (GeLU, SiLU, ReLU):

y=W2ϕ ⁣(W1x)y = W_2 \,\phi\!\bigl(W_1\, x\bigr)

with W1Rh×dW_1 \in \mathbb{R}^{h \times d}, W2Rd×hW_2 \in \mathbb{R}^{d \times h}, xRdx \in \mathbb{R}^d.

Step 1: Insert QTQ=IQ^TQ = I.

For any orthogonal QRd×dQ \in \mathbb{R}^{d \times d}:

y=W2ϕ ⁣(W1QTQx)y = W_2\, \phi\!\bigl(W_1\, Q^T Q\, x\bigr)

Step 2: Re-parenthesize.

y=W2ϕ ⁣((W1QT)(Qx))y = W_2\, \phi\!\bigl((W_1 Q^T)(Q x)\bigr)

Define W~1=W1QT\tilde{W}_1 = W_1 Q^T and x~=Qx\tilde{x} = Qx. Then:

y=W2ϕ(W~1x~)y = W_2\, \phi(\tilde{W}_1\, \tilde{x})

The output yy is bit-for-bit identical. The computation is parameterization-invariant under the orthogonal reparameterization W1W1QTW_1 \to W_1 Q^T, xQxx \to Qx.

Step 3: Propagate through the full residual stream.

The residual stream at layer ll carries xlx_l. Let all operations reading from position ll absorb QlTQ_l^T on the right of their weight, and all operations writing to position ll absorb QlQ_l on the left of their weight. Then:

  • The stream at position ll now carries QlxlQ_l x_l in the new parameterization
  • Every consumer WinW_\text{in} sees (WinQlT)(Qlxl)=Winxl(W_\text{in} Q_l^T)(Q_l x_l) = W_\text{in} x_l — unchanged output
  • Every producer WoutW_\text{out} now produces Ql(Woutxl1)Q_l (W_\text{out} x_{l-1}), which is the new stream at position ll

Theorem (Computational Invariance, ICLR 2024): For any pretrained transformer fθf_\theta and any sequence of orthogonal matrices {Ql}l=0L\{Q_l\}_{l=0}^L, there exists a reparameterized transformer fθ~f_{\tilde{\theta}} with fθ~(x)=fθ(x)f_{\tilde{\theta}}(x) = f_\theta(x) for all inputs xx.

This is an exact statement — no error, no approximation. The subsequent slicing (keeping only kk dimensions) introduces the only approximation.

Truncation Error Bound

After choosing QlQ_l to be the PCA matrix of calibration activations at layer ll, the truncation error (squared norm of discarded activation components) is bounded by:

ϵlCi=kl+1dλi(l)\epsilon_l \le C \sum_{i=k_l+1}^{d} \lambda_i^{(l)}

where λi(l)\lambda_i^{(l)} is the ii-th eigenvalue of the empirical covariance at layer ll. For well-trained large models, the eigenvalue spectrum decays sharply (Zipfian-like), making i>kλi\sum_{i > k} \lambda_i small even at modest kk.

Figure 1: Computational Invariance Diagram

flowchart LR
    subgraph Original["Original Parameterization"]
        x1["x ∈ ℝᵈ"] --> W1["W₁ ∈ ℝ^{h×d}"]
        W1 --> phi1["φ(·)  element-wise"]
        phi1 --> W2["W₂ ∈ ℝ^{d×h}"]
        W2 --> y1["y ∈ ℝᵈ"]
    end
    subgraph Rotated["After inserting Q^T Q = I"]
        x2["Qx ∈ ℝᵈ"] --> W1Q["W₁Q^T ∈ ℝ^{h×d}"]
        W1Q --> phi2["φ(·)  element-wise"]
        phi2 --> W2b["W₂ ∈ ℝ^{d×h}"]
        W2b --> y2["y ∈ ℝᵈ (identical)"]
    end
    Original -. "Insert Q^T Q = I\n(zero error)" .-> Rotated

Figure 1: The computation is identical in both parameterizations. Choosing Q as the PCA rotation orders the coordinates by variance, making the last k-to-d dimensions safe to discard.

The SliceGPT Algorithm

Algorithm 1: SliceGPT (Pseudocode)

Input:
  f_θ          pretrained transformer (L layers, hidden dim d)
  D_calib      calibration data: C sequences × T tokens each
                (paper uses C=256, T=2048 from C4 dataset)
  s            global sparsity ratio (paper uses s=0.25)

Output:
  f_θ̃          compressed transformer with hidden dim k = round(d·(1−s))

─────────────────────────────────────────────────────
Preprocessing (RMSNorm absorption):
  For each transformer block l:
    Fold scale parameter γ_l into the next weight:
      For W reading immediately after RMSNorm at l:
        W ← W · diag(γ_l)
    Remove RMSNorm from the model graph.
  (This step is exact: RMS normalization is invariant to orthogonal Q.)
─────────────────────────────────────────────────────
Layer-wise PCA and slicing:
  For l = 0 to L−1:

    (A) Collect activations:
        Run D_calib through layers 0..l−1 with a forward hook.
        A_l ← concatenate all token hidden states at position l
              shape: (d, N) where N = C × T

    (B) Compute PCA basis:
        C_l ← (1/N) · A_l @ A_l.T          # empirical covariance (d×d)
        eigenvalues, Q_l ← eigh(C_l)        # eigendecomposition
        # Q_l rows = eigenvectors sorted by DESCENDING eigenvalue

    (C) Choose slice width:
        k_l ← round(d · (1 − s))            # uniform sparsity
        # (non-uniform variant: optimize k_l via marginal EVR budget)

    (D) Transform and slice all weights at position l:
        For W_in ∈ {W_Q, W_K, W_V, W_gate, W_up}  # read from stream at l
          W_in ← (W_in @ Q_l.T)[:, :k_l]   # rotate then keep top-k cols

        For W_out ∈ {W_O, W_down}           # write to stream at l+1
          W_out ← (Q_{l+1} @ W_out)[:k_{l+1}, :]  # rotate then keep top-k rows
          (uses k_{l+1} from the NEXT iteration)

─────────────────────────────────────────────────────
Boundary transformations:
  Input embedding E ∈ ℝ^{V×d}:
    E ← (E @ Q_0.T)[:, :k_0]
  Output LM head W_lm ∈ ℝ^{V×d}:
    W_lm ← (W_lm @ Q_L.T)[:, :k_L]
─────────────────────────────────────────────────────
Optional recovery fine-tuning:
  Fine-tune f_θ̃ for 1 epoch on D_calib (or larger dataset)
  using standard AdamW with LoRA adapters.

Line-by-Line Explanation

Why absorb RMSNorm first?

RMSNorm computes RMSNorm(x)=xRMS(x)γ\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma. The RMS scale is x2/d\|x\|_2 / \sqrt{d}, which is invariant to orthogonal transformation since Qx2=x2\|Qx\|_2 = \|x\|_2. Therefore the RMS normalization itself is transparent to the basis change. The scale γ\gamma is a diagonal matrix and can be absorbed:

WWdiag(γ)W \leftarrow W \cdot \text{diag}(\gamma)

After absorption, there are no normalization layers to worry about. This simplification is not an approximation — it is algebraically exact.

Why layer-by-layer, not all at once?

PCA at layer ll must reflect the actual distribution of activations produced by layers 0,,l10, \ldots, l-1 with the weights and the calibration data. Using random Gaussian activations would give the wrong basis (the statistics of residual-stream activations are highly non-Gaussian). The layer-by-layer scan captures this correctly.

Why does this work for the residual stream?

Residual connections add the input to the output: xl+1=xl+Blockl(xl)x_{l+1} = x_l + \text{Block}_l(x_l). Both xlx_l and Blockl(xl)\text{Block}_l(x_l) live in the same Rd\mathbb{R}^d space, so applying the same QlQ_l to both is consistent. The addition is preserved: Qlxl+1=Qlxl+QlBlockl(xl)Q_l x_{l+1} = Q_l x_l + Q_l \text{Block}_l(x_l).

The Q matrices disappear at inference time.

After transformation, WinWinQlT[:,:k]W_\text{in} \leftarrow W_\text{in} Q_l^T[:, :k] is a dout×kd_\text{out} \times k matrix. It is stored as-is. At inference, the sliced model takes kk-dimensional inputs and produces kk-dimensional outputs. No Q matrix is consulted at inference — the rotation is baked into the weight values.

Dimension bookkeeping.

After slicing, each transformer block operates with:

  • Input/output residual stream: k=(1s)dk = (1-s)d dimensions
  • Q/K/V matrices: k×dheadk \times d_\text{head} (head dimension unchanged)
  • MLP: k×dffk \times d_\text{ff} and dff×kd_\text{ff} \times k (intermediate dim unchanged)

Total parameters scale as 12k2/12d2=(k/d)2=(0.75)2=0.5625\approx 12k^2/12d^2 = (k/d)^2 = (0.75)^2 = 0.5625 for dominant d2d^2 terms.

Figure 2: SliceGPT Compression Pipeline

flowchart TD
    A["Pretrained LLM\nhidden dim d"] --> B["Calibration dataset\n256 × 2048 tokens, C4"]
    B --> C["Step 1: Absorb RMSNorm γ\ninto adjacent weights"]
    C --> D["For each layer l:\nforward pass → A_l ∈ ℝ^{d×N}"]
    D --> E["PCA: covariance C_l = A_l A_l^T\neigh → Q_l, eigenvalues"]
    E --> F["Set k_l = round(d·(1−s))"]
    F --> G["W_in ← (W_in Q_l^T)[:, :k]\nW_out ← (Q_{l+1} W_out)[:k, :]"]
    G --> H{l < L?}
    H -->|Yes, l++| D
    H -->|Done| I["Transform embeddings E,\nLM head W_lm"]
    I --> J["Optional: 1-epoch fine-tuning\nwith LoRA"]
    J --> K["Compressed model\nhidden dim k = 0.75d\nDense matrices only"]

Figure 2: The full SliceGPT pipeline. The calibration phase (collecting activations, computing PCA) requires only forward passes — no gradients. The Q matrices are absorbed and not stored.

Handling Special Components

RMSNorm / LayerNorm

As derived above, RMSNorm is absorbed exactly into the first downstream weight. For LayerNorm (used in OPT), which also has a bias β\beta:

LayerNorm(x)=xμσγ+β\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta

The bias β\beta is absorbed into the bias term of the following linear layer:

bnew=Winβ+boldb_\text{new} = W_\text{in} \beta + b_\text{old}

After absorption, both LayerNorm and RMSNorm disappear from the compressed graph.

Multi-Head Self-Attention

For HH heads with per-head dimension dheadd_\text{head} (and Hdhead=dH \cdot d_\text{head} = d):

Q/K/V projections all read from the same stream at layer ll:

W~Q=(WQQlT)[:, :k],W~K=(WKQlT)[:, :k],W~V=(WVQlT)[:, :k]\tilde{W}_Q = (W_Q Q_l^T)[\text{:, :k}], \quad \tilde{W}_K = (W_K Q_l^T)[\text{:, :k}], \quad \tilde{W}_V = (W_V Q_l^T)[\text{:, :k}]

Each becomes a matrix of shape (Hdhead)×k(H \cdot d_\text{head}) \times k (from (Hdhead)×d(H \cdot d_\text{head}) \times d). The input dimension shrinks from dd to kk; the per-head output dimension dheadd_\text{head} is unchanged.

Output projection WORd×(Hdhead)W_O \in \mathbb{R}^{d \times (H \cdot d_\text{head})} writes to the stream at layer l+1l+1:

W~O=(Ql+1WO)[:k, :]\tilde{W}_O = (Q_{l+1} W_O)[\text{:k, :}]

The output dimension shrinks from dd to kl+1k_{l+1}; the input dimension HdheadH \cdot d_\text{head} is unchanged.

Grouped-Query Attention (LLAMA2-70B uses GQA): The key and value heads are shared across multiple query groups. SliceGPT handles this identically — the input dimension to WK,WVW_K, W_V shrinks from dd to kk, while the per-head dimension stays fixed.

MLP Block (SwiGLU)

LLAMA2’s MLP uses SwiGLU:

MLP(x)=Wdown ⁣(SiLU(Wgatex)Wupx)\text{MLP}(x) = W_\text{down}\!\bigl(\text{SiLU}(W_\text{gate}\, x) \odot W_\text{up}\, x\bigr)

Both WgateW_\text{gate} and WupW_\text{up} read from the layer-ll stream:

W~gate=(WgateQlT)[:, :k],W~up=(WupQlT)[:, :k]\tilde{W}_\text{gate} = (W_\text{gate}\, Q_l^T)[\text{:, :k}], \qquad \tilde{W}_\text{up} = (W_\text{up}\, Q_l^T)[\text{:, :k}]

WdownW_\text{down} writes to the layer-(l+1)(l+1) stream:

W~down=(Ql+1Wdown)[:k, :]\tilde{W}_\text{down} = (Q_{l+1}\, W_\text{down})[\text{:k, :}]

The intermediate dimension dffd_\text{ff} (≈8d/38d/3 for SwiGLU in LLAMA2) is not sliced in the basic algorithm. Slicing dffd_\text{ff} would require an additional PCA pass over post-nonlinearity activations and is left for future work.

Embedding and LM Head

The token embedding table ERV×dE \in \mathbb{R}^{V \times d} maps discrete token IDs to the residual stream at position 0. It must be aligned with the layer-0 basis Q0Q_0:

E~=(EQ0T)[:, :k0]\tilde{E} = (E Q_0^T)[\text{:, :}k_0]

The output LM head WlmRV×dW_\text{lm} \in \mathbb{R}^{V \times d} reads from the final residual stream (position LL):

W~lm=(WlmQLT)[:, :kL]\tilde{W}_\text{lm} = (W_\text{lm}\, Q_L^T)[\text{:, :}k_L]

After these transformations, the model is fully self-consistent. A kk-dimensional residual stream flows from the embedding table through all LL layers to the LM head with no mismatch.

Figure 3: Component-Level Slicing Map

flowchart LR
    RS_l["Residual stream l\ndim: k_l = 0.75d"] --> WQ["W_Q Q_l^T [:, :k]\ndim: d_h × k_l"]
    RS_l --> WK["W_K Q_l^T [:, :k]\ndim: d_h × k_l"]
    RS_l --> WV["W_V Q_l^T [:, :k]\ndim: d_h × k_l"]
    RS_l --> Wg["W_gate Q_l^T [:, :k]\ndim: d_ff × k_l"]
    RS_l --> Wu["W_up Q_l^T [:, :k]\ndim: d_ff × k_l"]
    WQ & WK & WV --> Attn["Attention\n(internal d_h unchanged)"]
    Attn --> WO["Q_{l+1} W_O [:k, :]\ndim: k_{l+1} × d_h"]
    Wg & Wu --> MLP["SiLU / GeLU\n(d_ff unchanged)"]
    MLP --> Wd["Q_{l+1} W_down [:k, :]\ndim: k_{l+1} × d_ff"]
    WO & Wd --> RS_l1["Residual stream l+1\ndim: k_{l+1} = 0.75d"]

Figure 3: Every weight that reads from the residual stream has its input dimension sliced from d to k. Every weight that writes to the stream has its output dimension sliced. Internal dimensions (d_h, d_ff) are unchanged.

Calibration: Practical Details

Dataset and Scale

SliceGPT uses 256 sequences of 2048 tokens from C4 (≈524K tokens total). The authors confirm that:

  • C4 and Wikitext-2 give essentially identical results (PCA basis is data-distribution-insensitive within natural text)
  • 128 sequences is sufficient; 512 provides marginal improvement
  • The calibration needs only inference-mode forward passes — no gradients, no optimizer state

For LLAMA2-70B at 8192 dimensions, each covariance matrix is 8192×8192=67M8192 \times 8192 = 67M entries (256 MB in FP32). With 80 layers, the total covariance storage is ~20GB — manageable on a single A100.

Explained Variance Ratio

After running PCA at layer ll, the explained variance ratio (EVR) at width kk is:

EVR(k,l)=i=1kλi(l)i=1dλi(l)\text{EVR}(k, l) = \frac{\sum_{i=1}^{k} \lambda_i^{(l)}}{\sum_{i=1}^{d} \lambda_i^{(l)}}

For LLAMA2-70B at k=0.75dk = 0.75d, EVR typically exceeds 99.5% at early/middle layers and drops slightly (to ~98.5%) at the final layers. This quantifies how much activation energy is preserved after slicing.

Non-Uniform Sparsity Allocation

A more principled variant optimizes per-layer klk_l subject to a global parameter budget:

mink0,,kLl=0L1i=kl+1dλi(l)truncation errors.t.l=0L1params(kl)B\min_{k_0, \ldots, k_L} \sum_{l=0}^{L-1} \underbrace{\sum_{i=k_l+1}^{d} \lambda_i^{(l)}}_{\text{truncation error}} \quad \text{s.t.} \quad \sum_{l=0}^{L-1} \text{params}(k_l) \le B

This can be solved greedily: sort layers by marginal truncation error per parameter removed and allocate budget accordingly. The paper reports that non-uniform allocation gives marginal improvement over uniform sparsity for large models; small models benefit more.

Experiments and Results

Experimental Setup

  • Models evaluated: LLAMA2-7B, 13B, 70B; OPT-13B, 30B, 66B; Phi-2 (2.7B)
  • Calibration data: 256 × 2048 tokens from C4
  • Evaluation benchmark: EleutherAI LM-Eval-Harness, 7 zero-shot tasks (WinoGrande, HellaSwag, PIQA, ARC-easy, ARC-challenge, OpenBookQA, BoolQ)
  • Baselines: SparseGPT (50% unstructured), Wanda (50% unstructured), LLM-Pruner (20% structured)

Table 1: Zero-Shot Accuracy at 25% Parameter Reduction

ModelDense Acc.SliceGPT Acc.RetainedDense GPUsSliced GPUs
LLAMA2-7B64.0%58.2%91.0%1×A1001×A100
LLAMA2-13B66.8%61.4%91.9%2×A1001×A100
LLAMA2-70B70.4%69.8%99.1%4×A1002×A100
OPT-66B66.7%66.1%99.1%4×A1002×A100
Phi-2 (2.7B)71.2%63.9%89.7%1×RTX30901×RTX3090

Key observation: Large models (≥66B) tolerate slicing far better than small models. LLAMA2-70B loses only 0.6 percentage points at 25% compression — within noise on individual tasks. Small models like Phi-2 lose ~7 points, reflecting their lower redundancy.

Table 2: Perplexity on Wikitext-2 (lower is better)

ModelDenseSliceGPT 20%SliceGPT 25%SparseGPT 50%
LLAMA2-7B5.475.826.826.51
LLAMA2-13B4.885.125.725.40
LLAMA2-70B3.323.403.523.51
OPT-66B9.349.559.809.76

At 25% structural reduction, SliceGPT is competitive with SparseGPT at 50% unstructured sparsity on 66–70B models. On smaller models, unstructured sparsity has a slight perplexity edge.

Compute and GPU Reduction

For LLAMA2-70B at s=0.25s = 0.25:

FLOPssliced/FLOPsdense=(k/d)2=(0.75)2=0.562556%\text{FLOPs}_\text{sliced} / \text{FLOPs}_\text{dense} = (k/d)^2 = (0.75)^2 = 0.5625 \approx 56\%

Empirically measured at 64–66% (slightly higher than theoretical because embedding and MLP-intermediate terms are not fully reduced). The model also fits on half the GPU count:

  • Dense: 4×A100-40GB required
  • Sliced: 2×A100-40GB sufficient

This is a practical infrastructure saving: half the hardware cost for 99% of the task performance.

Fine-Tuning Recovery

One epoch of LoRA fine-tuning after slicing:

  • LLAMA2-7B: ~2 points recovered (58.2% → 60.1%)
  • LLAMA2-70B: ~0.2 points recovered (already near-dense quality)
  • Phi-2: ~3 points recovered (63.9% → 67.1%)

Fine-tuning is most beneficial where the initial accuracy drop is largest (small models, high sparsity).

Figure 4: Performance vs. GPU Count (LLAMA2-70B)

flowchart LR
    subgraph DenseSetup["Dense LLAMA2-70B"]
        G1["GPU 1\n16.7B params"] & G2["GPU 2\n16.7B params"] & G3["GPU 3\n16.7B params"] & G4["GPU 4\n16.7B params"]
        G1 & G2 & G3 & G4 --> Perf1["70.4% zero-shot\n100% FLOPs"]
    end
    subgraph SlicedSetup["SliceGPT LLAMA2-70B (s=0.25)"]
        SG1["GPU 1\n~26B params"] & SG2["GPU 2\n~26B params"]
        SG1 & SG2 --> Perf2["69.8% zero-shot\n64% FLOPs\n(-0.6 pts, -2 GPUs)"]
    end

Figure 4: SliceGPT halves the GPU count for LLAMA2-70B inference while losing only 0.6 percentage points on zero-shot benchmarks.

Comparison to Prior Work

Table 3: Structural Compression Method Comparison (LLAMA2-70B, Wikitext-2 PPL)

MethodTypeParam ReductionPPLCustom KernelGPU Savings
Dense0%3.32No
SparseGPTUnstructured~50%3.51YesNo
WandaUnstructured~50%3.53YesNo
LLM-PrunerStructural~20%5.3NoPartial
SliceGPTStructural25%3.52No4→2 GPUs

SliceGPT is the only method in this table that simultaneously achieves: no custom kernels, actual GPU count reduction, and sub-4.0 perplexity at meaningful compression.

Why Computational Invariance is More Principled Than Magnitude-Based Pruning

Most structural pruning methods select neurons/channels to prune by magnitude, gradient, or Taylor expansion — all heuristics. SliceGPT instead:

  1. Applies a theoretically optimal basis change (the PCA rotation is optimal in the sense of minimizing reconstruction error after truncation, by the Eckart–Young theorem)
  2. The subsequent truncation discards directions that are demonstrably low-variance in the calibration distribution
  3. The basis change itself introduces zero approximation error — only the truncation does

This gives a principled upper bound on the error introduced: it is exactly the PCA reconstruction error (discarded eigenvalue sum), which can be computed and used to set the sparsity budget.

Figure 5: Taxonomy of Post-Training LLM Compression

flowchart TD
    root["Post-Training LLM Compression"]
    root --> Quant["Quantization\nGPTQ · AWQ · SmoothQuant · LLM.int8"]
    root --> Unstruct["Unstructured Sparsity\nSparseGPT · Wanda · Magnitude"]
    root --> Struct["Structural Pruning\nSliceGPT · LLM-Pruner · ShortGPT · FLAP"]
    root --> LowRank["Low-Rank Decomposition\nSVD-LLM · ASVD · TrLoRA"]
    Struct -->|"Theoretical basis:\ncomputational invariance + PCA"| SliceGPT_node["SliceGPT\n(this paper)"]

Figure 5: SliceGPT sits in the structured pruning quadrant, uniquely backed by a theoretical invariance argument rather than a magnitude-based heuristic.

Limitations and Boundary Conditions

Scale Dependence is Fundamental

The 99% retention at 25% sparsity holds only for 60B+ parameter models. At 7B:

  • Performance drop: ~6 points
  • Explanation: smaller models have less redundancy (eigenvectors of the activation covariance have flatter spectra, meaning more variance is spread across directions rather than concentrated in a few)

This is not a bug but a fundamental property: SliceGPT exploits over-parameterization. Models below ~13B are not sufficiently over-parameterized for 25% slicing to be near-lossless.

Intermediate MLP Dimension Untouched

The basic algorithm only slices the residual stream dimension dd. The MLP intermediate dimension dff=8d/3d_\text{ff} = 8d/3 (SwiGLU LLAMA2) is preserved. This means:

  • The WgateW_\text{gate} and WupW_\text{up} matrices change from dff×dd_\text{ff} \times d to dff×kd_\text{ff} \times k: savings proportional to (dk)/d=s(d - k)/d = s
  • The WdownW_\text{down} changes from d×dffd \times d_\text{ff} to k×dffk \times d_\text{ff}: same savings

But the element-wise nonlinearity and the intermediate activations still occupy dffd_\text{ff} dimensions. For models where the MLP dominates (e.g., MoE models), this limits the FLOP savings.

Benchmark Narrowness

All evaluations use short-answer, zero-shot classification tasks. No results are reported for:

  • Code generation (HumanEval, MBPP)
  • Mathematical reasoning (GSM8K, MATH)
  • Long-form instruction following (MT-Bench, AlpacaEval)
  • Long-context tasks (SCROLLS, LongBench)
  • Multilingual benchmarks

These tasks may be more sensitive to residual stream dimension reduction, particularly long-context tasks where the model must maintain a rich information state across many tokens.

Single-Architecture Evaluation at Large Scale

The 70B-scale experiments cover only LLAMA2 and OPT. Other modern large models — Falcon-180B, Mixtral-8×7B (MoE), GPT-NeoX-20B — are not evaluated. MoE architectures are especially interesting since their routing mechanism interacts with the residual stream in non-trivial ways.

Non-Linear Components Limit Invariance

Computational invariance holds for element-wise ϕ\phi because Qϕ(x)=ϕ(Qx)Q \phi(x) = \phi(Qx) only when ϕ\phi is the identity (which it obviously isn’t). Wait — this needs clarification: the invariance holds because ϕ\phi is applied to the intermediate vector (not the residual stream). The residual stream transformation QQ cancels out before reaching ϕ\phi. But if ϕ\phi were applied to the residual stream directly (as in some architectures), this would break.

For standard transformer attention, the softmax is applied to attention scores QKT/dkQK^T/\sqrt{d_k} — not the residual stream. The attention scores operate in the per-head space (dimension dheadd_\text{head}), which is not sliced. So attention softmax is handled correctly.

Critical Assessment: Weaknesses & Improvements

(a) Weaknesses and Flaws

The “25% parameter reduction” framing is imprecise. SliceGPT reduces the hidden dimension from dd to k=0.75dk = 0.75d. Weight matrices of shape dout×dd_\text{out} \times d become dout×kd_\text{out} \times k — one dimension changes. For a transformer with weight matrices of shape d×dd \times d, the parameter reduction per matrix is (d2dk)/d2=s=25%(d^2 - dk)/d^2 = s = 25\%. But the MLP’s dff×dd_\text{ff} \times d matrices only shrink in one of their two dimensions, and the embedding table V×dV \times d is very large. The net total parameter reduction depends on the model’s dimension ratios. The paper reports “up to 25% of model parameters including embeddings” — the “up to” qualifier deserves more prominence, and Table 2 in the paper shows exact per-model figures that vary meaningfully.

Perplexity comparison at different operating points. SliceGPT at 25% structural sparsity is compared against SparseGPT/Wanda at 50% unstructured sparsity. The paper frames this as “competitive,” but these methods are not at the same FLOP reduction point. SparseGPT at 50% unstructured sparsity has half the weight parameters but, without custom sparse kernels, no latency benefit — while SliceGPT at 25% structural sparsity has ~44% FLOP reduction but actual latency benefit. A fair comparison would match on actual measured throughput (tokens/second) at the same hardware budget, not on nominal parameter counts.

No ablation on calibration dataset size or domain. The paper states C4 and Wikitext-2 give similar results (one comparison), but provides no systematic study. For practitioners deploying SliceGPT on domain-specific models (medical, legal, code), it is unknown whether calibrating with C4 is adequate or whether domain-matched calibration data is necessary. This is a practical gap.

No latency measurements on realistic inference workloads. The paper reports FLOP counts and mentions running on fewer GPUs, but does not report actual tokens/second at various batch sizes. For memory-bandwidth-bound regimes (small batch sizes), FLOP reduction does not directly translate to latency reduction. The “faster” claim, while plausible, is not fully substantiated.

Limited fine-tuning analysis. The paper briefly mentions 1-epoch LoRA fine-tuning but does not explore: how much recovery is possible with more compute (3–5 epochs), what training data is optimal, or whether full fine-tuning outperforms LoRA for recovery.

(b) Limitations the Authors Understate

KV-cache size is unaffected in the basic algorithm. Since per-head dimension dheadd_\text{head} is not sliced (only the input dimension dd of WK,WVW_K, W_V changes), the K and V vectors output by the projections still have dimension dheadd_\text{head} per head. The KV cache size is therefore unchanged. For long-context inference where KV cache is the primary memory bottleneck, SliceGPT provides no direct benefit. The paper does not acknowledge this.

Tensor-parallel sharding may be complicated. The reduced hidden dimension k=0.75dk = 0.75d may not be evenly divisible by the number of GPUs in tensor-parallel settings. For LLAMA2-70B: d=8192d = 8192 (easily divisible by 8), k=6144=0.75×8192=211×3k = 6144 = 0.75 \times 8192 = 2^{11} \times 3 — divisible by 8 but not by all desired tensor-parallel degrees. For non-power-of-2 kk, padding or irregular sharding is needed.

Weight materialization overhead during compression. During compression, both the original weight WW and the transformed WQlTW Q_l^T must be held in memory simultaneously. For LLAMA2-70B with 140B parameters at FP16, this transiently requires ~280GB — more than 4×A100-80GB can hold. The paper reports 4×A100-80GB is sufficient but does not detail how this is managed (likely layer-by-layer with careful memory management).

(c) Concrete Improvement Suggestions

1. Slice the MLP intermediate dimension. Apply PCA to the post-activation intermediate activations (the vector after the GeLU/SiLU) and additionally reduce dffd_\text{ff}. This requires two PCA passes per layer (one at the residual stream, one at the MLP intermediate) but would provide proportional FLOP savings across all weight matrices. Expected benefit: at s=0.25s = 0.25 on both dd and dffd_\text{ff}, total FLOPs reduce to (0.75)2=56%\sim(0.75)^2 = 56\% rather than the current ~64–66%.

2. Non-uniform allocation with validation-loop tuning. Use a small held-out set (16–32 sequences) to measure actual perplexity impact of slicing each layer independently. Protect layers that show large perplexity sensitivity (typically layers near the input and output) and aggressively slice middle layers. Gradient-free black-box optimization (CMA-ES or a greedy scan) over the {kl}\{k_l\} schedule could substantially improve the accuracy-compression trade-off without additional compute.

3. Evaluate on reasoning and code benchmarks. Add HumanEval (code generation), GSM8K (math), and MT-Bench (instruction following) to the evaluation suite. If SliceGPT degrades disproportionately on these tasks — which require multi-step precision — this should be transparently reported, with per-task analysis of which tasks are most sensitive to dd reduction.

4. Combine with quantization and measure jointly. Apply AWQ or GPTQ after SliceGPT and compare against AWQ/GPTQ alone on the same hardware. If SliceGPT + INT4 achieves better throughput than INT4 alone at similar accuracy, that is a compelling deployment story the paper misses. The combination is natural (slicing reduces the matrix sizes before quantization) but unexplored.

5. Measure KV-cache impact of slicing the head dimension. Extend the algorithm to also apply PCA on the per-head key/value activations (a separate PCA within each attention head) and slice dheadd_\text{head}. This would reduce KV cache memory proportionally, which is critical for long-context serving. This is a non-trivial extension but directly addresses the KV-cache limitation identified above.

Deep Dive: SliceGPT vs. Low-Rank Matrix Decomposition

It is easy to conflate SliceGPT with weight-level low-rank decomposition methods (e.g., SVD-LLM, ASVD). Both involve SVD and both produce smaller weight matrices. The difference is conceptual and has practical consequences.

Low-Rank Decomposition of Weights (What SliceGPT is NOT)

Conventional low-rank compression approximates each weight matrix individually:

WUkΣkVkTW \approx U_k \Sigma_k V_k^T

where UkRm×kU_k \in \mathbb{R}^{m \times k}, VkTRk×nV_k^T \in \mathbb{R}^{k \times n}. This replaces one m×nm \times n matrix with two smaller ones: the computation changes from y=Wxy = Wx to yUk(ΣkVkTx)y \approx U_k (\Sigma_k V_k^T x).

Problems with weight-level SVD:

  • Each matrix is approximated independently, ignoring that the approximation errors of consecutive layers accumulate through the residual stream
  • The approximation is in the weight space — the truncated directions in WW may not correspond to directions that the activations actually occupy
  • The two smaller matrices (UkU_k, VkTV_k^T) both need to be stored and multiplied; unless kmin(m,n)k \ll \min(m,n), the inference overhead can actually increase due to two separate GEMM calls

What SliceGPT Actually Does

SliceGPT applies SVD/PCA to the activations, not the weight matrices. The key mathematical distinction:

Step 1 (SliceGPT): Find QlQ_l such that the activations AlA_l have maximum variance in the first klk_l coordinates.

Step 2 (SliceGPT): Rotate all weights that touch position ll to be consistent with this new basis. This step is exact (computational invariance).

Step 3 (SliceGPT): Truncate the last dkld - k_l coordinates. This step is the only approximation, and its error equals the discarded eigenvalue sum.

The resulting weights are single matrices (not product pairs): a matrix that was m×dm \times d becomes m×km \times k — one matrix, not two. This is why inference computation actually decreases rather than just being rearranged.

Formal Comparison of Error Sources

Low-rank SVD of weight WW:

Error=WUkΣkVkTF=i>kσi(W)2\text{Error} = \|W - U_k \Sigma_k V_k^T\|_F = \sqrt{\sum_{i>k} \sigma_i(W)^2}

This error is in weight space and may not reflect what activations actually use.

SliceGPT truncation at layer ll:

Errorl=lWoutQlxlWoutQl(k)xlCi>klλi(l)\text{Error}_l = \left\|\sum_l W_\text{out} \cdot Q_l x_l - W_\text{out} \cdot Q_l^{(k)} x_l\right\| \le C \sum_{i > k_l} \lambda_i^{(l)}

where λi(l)\lambda_i^{(l)} are eigenvalues of the activation covariance. This error is in activation space — directly measuring how much of the actual runtime information is discarded.

The activation-space error bound is tighter and more meaningful for downstream task performance, because it directly measures how much the model “sees” in the directions being removed.

Figure 6: SliceGPT vs. Weight-Level Low-Rank Decomposition

flowchart TB
    subgraph WeightSVD["Weight-Level SVD (e.g., SVD-LLM)"]
        W1["W ∈ ℝ^{m×d}"] -->|"SVD truncation"| UV["U_k (m×k)\n× Σ_k V_k^T (k×d)\nTwo matrices"]
        UV --> Err1["Error: ‖W − U_kΣ_kV_k^T‖_F\n(in weight space)"]
    end
    subgraph SliceGPT_Diag["SliceGPT (activation-space)"]
        W2["W ∈ ℝ^{m×d}"] -->|"Rotate: W Q_l^T"| WQ["W Q_l^T ∈ ℝ^{m×d}\n(exact, zero error)"]
        WQ -->|"Slice: keep cols 1:k"| Wk["W Q_l^T [:, :k] ∈ ℝ^{m×k}\nOne matrix"]
        Wk --> Err2["Error: Σ_{i>k} λᵢ(activation covariance)\n(in activation space, tighter bound)"]
    end

Figure 6: Weight-level SVD produces a product of two matrices and measures error in weight space. SliceGPT produces a single smaller matrix and measures error in activation space — directly bounding the impact on runtime behavior.

Sparsity Scaling Behavior

How Does Accuracy Degrade as Sparsity Increases?

Understanding the accuracy-sparsity curve is critical for practitioners choosing the operating point. The paper reports results at 20% and 25% sparsity for some models, and at 30%+ for others. The qualitative pattern is:

  • 0–10% sparsity: Nearly zero accuracy loss for all model sizes. The high-variance PCA directions are far more important than the low-variance ones; removing only the tail is almost free.
  • 10–20% sparsity: Negligible loss for 70B+, small loss (~1–2 points) for 7–13B. Still practically useful.
  • 25% sparsity: The “sweet spot” for large models — 99% retention at 70B. For 7B, the 6-point loss becomes noticeable.
  • 30%+ sparsity: Accuracy drops accelerate nonlinearly. The eigenvalue spectrum decays rapidly but not infinitely; at high sparsity, directions with meaningful variance are being removed.

This nonlinear degradation pattern matches the mathematical prediction: the truncation error grows slowly at first (low eigenvalues discarded) and then quickly (eigenvalues with non-trivial variance begin to be discarded).

Per-Layer Eigenvalue Spectra

Analyzing the eigenvalue spectra of AlAlTA_l A_l^T at different layers reveals:

  • Early layers (l = 0–10): Relatively flat spectra (activations use many directions roughly equally). These layers are harder to compress and benefit most from non-uniform sparsity (lower ss).
  • Middle layers (l = 10–60 for 70B): Steep spectra, very high EVR even at k=0.5dk = 0.5d. High redundancy.
  • Final layers (l > 60): Moderate spectra. The LM head needs to distinguish many different token predictions, requiring more dimensions.

This layered structure explains why uniform sparsity works well on average but non-uniform allocation (protecting early and late layers) can unlock better accuracy at the same compute budget.

Interaction with Model Architecture Variants

Tied embeddings: Some models tie the input embedding EE and output LM head WlmW_\text{lm}. SliceGPT’s treatment of these as separate matrices would break the tie. The codebase handles this by only transforming one of them and re-tying after compression.

Rotary Position Embeddings (RoPE): LLAMA2 uses RoPE for positional encoding. RoPE is applied to Q and K after the projections, operating in the per-head space (dimension dheadd_\text{head}). Since SliceGPT does not change dheadd_\text{head}, RoPE is unaffected.

ALiBi Positional Biases (OPT): Additive biases in attention scores, again in the per-head space. Unaffected by residual-stream slicing.

Reproducibility Notes

  • Code: github.com/microsoft/TransformerCompression (MIT license)
  • Calibration data: HuggingFace allenai/c4 English subset; 256 × 2048 tokens; takes ~30 min to preprocess
  • Compression runtime: ~1–2 hours on 4×A100-80GB for LLAMA2-70B; single-GPU is sufficient for ≤13B models
  • Evaluation: EleutherAI LM-Eval-Harness v0.3+; 7-task zero-shot average
  • Determinism: Fully deterministic given fixed calibration sequence order; no randomness after calibration sampling
  • Dependencies: PyTorch ≥ 2.0, transformers, datasets, scipy.linalg.eigh (for covariance eigendecomposition)
  • Memory for compression: Requires holding full-precision weights plus one layer’s covariance matrix at a time; ~160GB peak for LLAMA2-70B

The algorithm is straightforward: ~200 lines of PyTorch to implement from scratch, making SliceGPT one of the most accessible papers in post-training compression for pedagogical purposes.

Summary: Design Decisions at a Glance

Before concluding, here is a quick reference capturing SliceGPT’s key design choices and their implications:

DecisionRationaleOpen Gap
PCA basis (not random Q)Minimizes activation reconstruction error (Eckart-Young optimal)Requires calibration forward pass
Uniform sparsity by defaultSimple; near-optimal for large modelsSuboptimal for small models — non-uniform is better
Absorb RMSNorm into weightsExact simplification; no extra ops at inferenceOnly works for diagonal-scale norms
Preserve dffd_\text{ff}Avoids second PCA passLeaves MLP FLOP savings on the table
256-sequence calibrationSufficient for stable PCA; low overheadMay be domain-sensitive for specialized models
Optional fine-tuningAvoids training setup for large modelsSmall models benefit significantly from even 1 epoch

Each “Open Gap” row is a concrete future research direction. Together they sketch a roadmap for extending SliceGPT to higher compression ratios and broader deployment scenarios.

Conclusion

SliceGPT makes a clean theoretical contribution — the computational invariance theorem — and translates it directly into an engineering outcome: smaller, faster, hardware-agnostic transformer inference. The insight that an orthogonal basis change is transparent to the computation, and that PCA identifies the optimal basis for subsequent truncation, is both elegant and practically powerful.

At 70B scale, the method delivers compelling results: 25% compression with 99% task performance, halved GPU count, and 34–36% FLOP reduction — all without custom kernels. For practitioners deploying LLAMA2-70B or similar models, SliceGPT represents one of the most deployment-friendly compression options available.

The method’s limitations are equally clear: it is most effective for large (≥30B) models, has been validated primarily on short zero-shot classification tasks, leaves the MLP intermediate dimension untouched, and does not address the KV cache. These are not disqualifying limitations but they define the boundary conditions for when SliceGPT is the right tool.

For researchers building on this work, the most impactful next steps are: MLP intermediate slicing, per-head KV-cache reduction via head-dimension PCA, extended evaluation on reasoning and code tasks, and composability with quantization. The computational invariance theorem itself is a result worth studying independently — it may underpin future compression methods for other neural architectures beyond transformers.

Personal take: SliceGPT is one of the most pedagogically clean papers in post-training compression. The core insight fits in five lines of algebra, the code is minimal and well-commented, and the 70B results are genuinely impressive. Reading the computational invariance proof is time well spent for anyone working with transformer internals.

For deeper context:

  • SparseGPT (Frantar & Alistarh, NeurIPS 2023) — unstructured counterpart; compare their layer-wise reconstruction against SliceGPT’s PCA calibration
  • SVD-LLM (Wang et al., 2024) — weight-level SVD for LLMs; contrasts with SliceGPT’s activation-space philosophy
  • ASVD (Yuan et al., 2023) — activation-aware SVD, thematically closest to SliceGPT but at the weight level
  • QuIP# (Tseng et al., NeurIPS 2024) — uses random orthogonal incoherence transforms before quantization; the orthogonal-transform idea is mathematically related to SliceGPT’s basis change, applied to enable better quantization rather than slicing
  • LLM-Pruner (Ma et al., 2023) — gradient-guided structural pruning; shows how heuristic-based methods compare in accuracy-compression trade-off
  • TransformerCompression (GitHub) — official open-source implementation by Microsoft Research; clean, well-documented, actively maintained

Conceptual Reading Path

For a reader new to post-training compression, the recommended reading order is:

  1. This paper (SliceGPT) — start here to understand the theoretical framework
  2. SparseGPT — see how unstructured methods handle the same calibration-based problem differently
  3. SVD-LLM / ASVD — compare weight-space vs. activation-space SVD approaches directly
  4. QuIP# — see how the same orthogonal-transform idea is applied in a quantization context
  5. LLM-Pruner — understand gradient-guided structural pruning to appreciate SliceGPT’s calibration-only simplicity

This path builds a coherent mental model: the common thread across all these methods is using calibration data to guide compression decisions, with each method differing in what is compressed (weights, activations, structure) and how the calibration signal is used (Hessian, PCA, gradient).