DoRA: Weight-Decomposed Low-Rank Adaptation — Technical Review
Review date: 2026-05-22 Reviewer: Zhongzhu Zhou Paper: DoRA: Weight-Decomposed Low-Rank Adaptation Authors: Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen (NVIDIA & HKUST) arXiv: 2402.09353v6, 2024-07-09 Venue: ICML 2024 (Oral), PMLR 235
Short answer
LoRA is the workhorse of parameter-efficient fine-tuning — cheap, fast, and practical. But it consistently trails full fine-tuning (FT) in accuracy. The standard explanation has been “LoRA just doesn’t have enough trainable parameters.” DoRA challenges that story with hard evidence: the problem is not parameter count, it’s the structure of the update.
The key insight: full fine-tuning and LoRA update weights in qualitatively different ways. FT tends to make either large magnitude changes or large directional changes — not both proportionally. LoRA, by contrast, always couples them: increase rank means both go up, decrease rank means both go down. It lacks the fine-grained control to move the weight vector strongly in one dimension while holding the other nearly fixed.
DoRA fixes this by borrowing from weight normalization (Salimans & Kingma, 2016): decompose any weight matrix into a magnitude vector and a direction matrix , then treat them as separate trainable quantities. Because the direction is high-dimensional, LoRA is applied there for efficiency. The magnitude — just one scalar per column — is trained directly. The merged weight at inference is identical to a plain dense matrix, so there’s zero inference overhead.
In concrete numbers: on LLaMA-3-8B commonsense reasoning, DoRA surpasses LoRA by +4.4 points while using virtually the same parameter budget (0.71% vs 0.70%). DoRA† (half the rank of LoRA) beats LoRA by +4.2 points with half the trainable parameters. On LLaVA-1.5-7B visual instruction tuning, DoRA improves by +0.7 points over LoRA and +1.1 over full FT. The improvement is consistent across every model, task, and rank setting tested — this is not cherry-picking.
The paper is also a good case study in analysis-driven design: the authors first built a diagnostic tool (weight decomposition analysis), found a structural difference between FT and LoRA, and then directly designed DoRA to close that gap. The resulting method is conceptually tight and the empirical improvements are unusually reproducible.
1. Prerequisites
This section is for readers who have worked with transformers but haven’t studied the theory of weight normalization, LoRA internals, or parameter-efficient fine-tuning design space. Skip §1.1–1.3 if you’ve read the LoRA and AdaLoRA papers; skip §1.4–1.5 if you’ve read the weight normalization paper.
1.1 Full fine-tuning and its cost
Given a pretrained model with parameter vector , full fine-tuning (FT) finds:
with initialized at the pretrained weights . For a 7-billion-parameter LLM, this means storing and updating 7B float32 parameters (28 GB) every step — plus optimizer states (Adam stores first and second moment: another 56 GB), plus activations for backprop. In practice, FT requires 80–160 GB of GPU memory for a 7B model, which rules it out for most practitioners.
1.2 The PEFT design space
Parameter-efficient fine-tuning (PEFT) methods reduce trainable parameters by orders of magnitude. The design space has three broad families:
Adapter-based: Insert small bottleneck modules (typically linear → nonlinear → linear with a narrow middle dimension) at specific points in the transformer (after self-attention, after FFN). Only the adapter weights are trained. Sequential adapters add latency because they cannot be merged; parallel adapters can sometimes be fused.
Prompt-based / prefix-tuning: Prepend trainable “soft tokens” to the input sequence or to each layer’s key-value cache (prefix tuning). The backbone is frozen; only the soft tokens are optimized. These are sensitive to initialization and usually underperform adapters.
Low-rank update (LoRA family): Model the weight update as a low-rank product rather than a full-rank matrix. After training, is merged into with zero inference overhead. This is the dominant paradigm and the focus of DoRA.
1.3 LoRA: mathematical formulation
For a pretrained weight matrix , LoRA (Hu et al., ICLR 2022) restricts the weight update to be low-rank:
W' = W_0 + \Delta W = W_0 + BA \tag{1}
where , , and .
Initialization strategy: is initialized with random Gaussian (Kaiming uniform); is initialized to zero. This ensures at the start of training — the model starts from exactly the pretrained weights.
Parameter count: Instead of parameters, LoRA uses parameters. For a typical attention projection with and , this is versus — a 128× reduction.
Inference merge: At deployment, compute once and store the merged dense matrix. Forward pass is identical to the original model with no overhead.
Scaling: In practice, LoRA applies a scaling factor to , where is a hyperparameter (often set to the same value as ). This rescales the learning rate effect and decouples hyperparameter tuning from rank.
1.4 Weight normalization and the magnitude-direction decomposition
Weight normalization (Salimans & Kingma, NeurIPS 2016) reparameterizes a weight vector as:
\mathbf{w} = g \cdot \frac{\mathbf{v}}{\|\mathbf{v}\|} \tag{2}
where is a scalar magnitude and is a direction vector. The motivation is conditioning: if the gradient covariance is better aligned with the identity matrix, SGD converges faster. The key property is that and decouple magnitude from direction, so the optimizer can adjust them independently at different rates.
For a matrix with columns (each column is a weight vector), the column-wise generalization is:
W = \mathbf{m} \cdot \frac{V}{\|V\|_c} \tag{3}
where is the row vector of column-norms, is the direction matrix, and denotes the column-wise norm operation (i.e., divide each column of by its norm). After this, every column of is a unit vector.
The difference from weight normalization is the initialization: weight normalization trains from random initialization (sensitive to initialization), whereas DoRA initializes from the pretrained weights ( and at the start), which sidesteps initialization sensitivity.
1.5 What “learning pattern” means in this context
DoRA introduces a weight decomposition analysis (Section 3 of the paper). Given a fine-tuned weight at training step and the pretrained weight , decompose both:
Then define the magnitude difference:
\Delta M^t = \frac{1}{k} \sum_{n=1}^{k} |m_n^t - m_n^0| \tag{4}
and the directional difference:
\Delta D^t = \frac{1}{k} \sum_{n=1}^{k} \left(1 - \cos(V_n^t, W_0^n)\right) \tag{5}
where is cosine similarity and are the -th columns of and .
By plotting scatter plots across layers and training steps, the authors reveal that:
- Full FT: Points scatter with a negative slope (large direction change correlates with small magnitude change, and vice versa).
- LoRA: Points scatter with a positive slope (direction and magnitude always increase/decrease together).
- DoRA: Points scatter with a negative slope similar to FT.
The Pearson correlation between and is for FT, for LoRA, and for DoRA — confirming that DoRA’s learning pattern is qualitatively more similar to FT than LoRA is.
2. Method
2.1 The core problem with LoRA’s coupled updates
Figure 1: FT vs LoRA vs DoRA — learning patterns
graph TD
subgraph FT["Full Fine-Tuning (FT)"]
F1["Large ΔD → Small ΔM (or vice versa)"]
F2["Negative slope in scatter plot"]
F3["Correlation(ΔD, ΔM) = −0.62"]
end
subgraph LR["LoRA"]
L1["ΔD and ΔM always proportional"]
L2["Positive slope in scatter plot"]
L3["Correlation(ΔD, ΔM) = +0.83"]
L4["Cannot make subtle directional change\nwithout also changing magnitude"]
end
subgraph DR["DoRA"]
D1["Large ΔD → Small ΔM (or vice versa)"]
D2["Negative slope (like FT)"]
D3["Correlation(ΔD, ΔM) = −0.31"]
D4["Decoupled by design"]
end
FT -- "DoRA mimics" --> DR
LR -- "DoRA improves" --> DR
Why does positive slope hurt? When LoRA wants to make a strong directional change (move the weight vector to point in a new direction), its positive coupling forces the magnitude to increase simultaneously. Conversely, when a small directional update suffices (the pretrained weight already points roughly right), LoRA still inflates the magnitude proportionally. This rigid coupling forces LoRA into a suboptimal learning trajectory — it can’t make “diagonal” updates in the plane the way FT does.
2.2 The DoRA formulation
Drawing on the weight decomposition from Eq. (3), DoRA decomposes the pretrained weight into magnitude and direction, then fine-tunes both:
W' = \mathbf{m} \cdot \frac{V + \Delta V}{\|V + \Delta V\|_c} = \mathbf{m} \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c} \tag{6}
What’s trained:
- : the magnitude vector, trained directly (column-wise scalar, tiny parameter count)
- , : the LoRA matrices for directional update
What’s frozen: (the original weight, used as the frozen base for direction)
Initialization: At start of training, (so ), meaning . The magnitude . This gives , so DoRA starts exactly at the pretrained weights — same as LoRA.
Inference merge: After training, is a dense matrix of the same shape as . It can be pre-computed and stored, with zero inference overhead.
2.3 The algorithm, step by step
Algorithm 1: DoRA Fine-Tuning
Input: Pretrained weight W₀ ∈ ℝ^{d×k}, rank r, target task dataset D
Output: Fine-tuned merged weight W' ∈ ℝ^{d×k}
---
Initialization:
1. Compute m ← column-wise ℓ₂ norm of W₀ # m ∈ ℝ^{1×k}
2. Set V ← W₀ # frozen direction base
3. Initialize A ~ Kaiming_uniform(r, k) # LoRA A matrix
4. Initialize B ← 0_{d×r} # LoRA B matrix (zero init)
5. Mark as trainable: {m, A, B}
6. Mark as frozen: {V (= W₀)}
Forward pass (each training step):
7. Compute ΔV ← B @ A # low-rank directional delta
8. Compute V' ← V + ΔV # updated direction (unnorm.)
9. Compute norms ← column_norms(V') # ℝ^{1×k}, treated as CONSTANT
# (detach from grad graph)
10. Compute W' ← m * (V' / norms) # ∈ ℝ^{d×k}
11. Compute output ← W' @ x
Backward pass:
12. Compute ∂L/∂W' via autograd
13. Gradient w.r.t. m: ∂L/∂m = (∂L/∂W') · V' / norms
= ||∇_{W'} L|| · cos(∇_{W'} L, v') [Eq. 9]
14. Gradient w.r.t. V': ∂L/∂V' = (m / norms) · ∂L/∂W' [Eq. 11]
(propagated to A and B through ΔV = BA)
15. Update {m, A, B} with optimizer step
Post-training merge (once, before deployment):
16. Compute W' ← m * (W₀ + B@A) / column_norms(W₀ + B@A)
17. Store W' as the deployed weight (dense, same shape as W₀)
18. Discard m, A, B
Key implementation note (line 9): The column norms are computed dynamically each step (so they track the evolving ), but they are detached from the gradient graph. This means is computed as if norms were constant — i.e., where . This eliminates a significant memory overhead in backprop (saves ~24% GPU memory on LLaMA-7B) with negligible accuracy loss ( points on commonsense reasoning).
2.4 Gradient analysis: why decomposition stabilizes LoRA
This is the most mathematically interesting part of the paper. Let’s derive the full gradient equations.
Starting from DoRA’s forward pass (treating as constant per the optimization from §2.3):
Gradient of loss w.r.t. (and thus w.r.t. ):
\nabla_{V'} \mathcal{L} = \frac{\mathbf{m}}{C} \cdot \nabla_{W'} \mathcal{L} \tag{7}
This is a pure rescaling of the weight gradient — the direction is the same, but magnitude is modulated by . Notice what this does:
- Columns of with large magnitude relative to their norm ( large) receive larger gradients.
- This mimics gradient preconditioning: the update to the direction is scaled by the “how important is this column’s current magnitude.”
Gradient of loss w.r.t. (column-wise):
\nabla_{m_n} \mathcal{L} = \frac{\nabla_{W'} \mathcal{L} \cdot V'_n}{\|V'_n\|} = \|\nabla_{W'} \mathcal{L}_n\| \cdot \cos(\nabla_{W'}\mathcal{L}_n, V'_n) \tag{8}
Key insight from Eq. (8): The gradient for the magnitude scalar depends on the cosine alignment between the loss gradient and the current direction vector. When the loss gradient is nearly perpendicular to the current direction (small cosine → the directional update should be large, not the magnitude), is small — so the magnitude barely changes while the direction updates. Conversely, when the gradient aligns well with the current direction (large cosine → the weight mostly needs to scale up/down without rotating), the magnitude update is large. This is exactly the negative correlation between and observed empirically.
In other words, Eq. (8) mathematically explains why DoRA exhibits FT-like learning patterns: the gradient geometry automatically decouples direction updates from magnitude updates.
Figure 2: Gradient flow in DoRA vs LoRA
graph LR
subgraph LoRA_grad["LoRA Backward Pass"]
lg1["∂L/∂W' ∈ ℝ^{d×k}"]
lg2["∂L/∂B = (∂L/∂W') Aᵀ"]
lg3["∂L/∂A = Bᵀ (∂L/∂W')"]
lg1 --> lg2
lg1 --> lg3
lg4["ΔM and ΔD always coupled\n(via BA product)"]
lg2 --> lg4
lg3 --> lg4
end
subgraph DoRA_grad["DoRA Backward Pass"]
dg1["∂L/∂W' ∈ ℝ^{d×k}"]
dg2["∂L/∂m = (∂L/∂W')·V'/‖V'‖\n= ‖grad‖·cos(grad, v')"]
dg3["∂L/∂V' = (m/C)·∂L/∂W'\n→ propagates to A, B"]
dg1 --> dg2
dg1 --> dg3
dg4["cos(grad, v') large → big Δm, small ΔD\ncos(grad, v') small → small Δm, big ΔD\n(negative correlation, like FT)"]
dg2 --> dg4
dg3 --> dg4
end
2.5 The subtle “decoupling” argument, more carefully
One might wonder: why does decoupling magnitude from direction help, specifically? Here is the precise argument from the paper.
Consider two hypothetical update scenarios and with equal gradient norms: . In , the update is mostly along the current weight direction (large ≈ large magnitude change, small directional change). In , the update is mostly perpendicular to the current direction (small ≈ small magnitude change, large directional change).
For LoRA:
- ,
- These gradients update both magnitude and direction implicitly through . There’s no mechanism to sense whether the current step should prioritize magnitude or direction.
For DoRA:
- In scenario : large large large magnitude update, small direction update (because is mostly in the direction already captured by ).
- In scenario : small small small magnitude update, large direction update.
This auto-routing of gradient energy between magnitude and direction is the core efficiency gain.
2.6 DVoRA: DoRA + VeRA
DoRA is modular: the low-rank component can be replaced by any LoRA variant. The paper demonstrates this with VeRA (Kopiczko et al., ICLR 2024).
VeRA (Vector-based Random Matrix Adaptation): freeze a single shared pair of random matrices across all layers; use only layer-specific scaling vectors as trainable parameters:
VeRA achieves 10× fewer trainable parameters than LoRA at the cost of some accuracy. DVoRA plugs VeRA in as the directional update in DoRA:
On MT-Bench with LLaMA2-7B, DVoRA achieves score 6.0, matching DoRA and surpassing both VeRA (5.5) and LoRA (5.7), with only 0.04% trainable parameters vs LoRA’s 2.31%. This is a 58× parameter reduction at equal accuracy.
2.7 Architecture overview diagram
Figure 3: DoRA System Architecture
graph TD
subgraph initialization["Initialization (once, from W₀)"]
W0["W₀ ∈ ℝ^{d×k}\n(pretrained, frozen)"]
decompose["Decompose into\nm = ‖W₀‖_c (trainable)\nV = W₀ (frozen direction base)"]
W0 --> decompose
end
subgraph lora_branch["LoRA Branch (trainable)"]
A["A ∈ ℝ^{r×k}\n(Kaiming init)"]
B["B ∈ ℝ^{d×r}\n(zero init)"]
delta["ΔV = B@A\n∈ ℝ^{d×k}"]
A --> delta
B --> delta
end
subgraph forward["Forward Pass"]
add["V' = W₀ + ΔV"]
norm["C = ‖V'‖_c\n(detached from grad)"]
mag["m ∈ ℝ^{1×k}\n(trainable scalar per column)"]
out["W' = m · (V'/C)\n∈ ℝ^{d×k}"]
delta --> add
W0 --> add
add --> norm
add --> out
norm --> out
mag --> out
end
subgraph merge["Merge (inference, once)"]
merged["W'_merged = m · (W₀ + B@A) / ‖W₀+B@A‖_c\nStore as dense matrix, discard {m,A,B}"]
out --> merged
end
decompose --> lora_branch
decompose --> forward
3. Experiments
3.1 Commonsense reasoning (LLaMA family)
Setup: Eight commonsense reasoning benchmarks: BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, OBQA. Training data: the 8 task training sets combined (following the LLM-Adapters protocol from Hu et al., 2023). Models: LLaMA-7B, LLaMA-13B, LLaMA2-7B, LLaMA3-8B. Evaluated baselines: Prefix, Series adapter, Parallel adapter, LoRA, ChatGPT (zero-shot CoT).
Figure 4: Commonsense Reasoning Results Summary
| Model | Method | #Params (%) | Avg. Accuracy |
|---|---|---|---|
| LLaMA-7B | ChatGPT (0-shot) | — | 77.0 |
| LLaMA-7B | LoRA | 0.83 | 74.7 |
| LLaMA-7B | DoRA† (r/2) | 0.43 | 77.5 |
| LLaMA-7B | DoRA | 0.84 | 78.4 (+3.7) |
| LLaMA-13B | LoRA | 0.67 | 80.5 |
| LLaMA-13B | DoRA | 0.68 | 81.5 (+1.0) |
| LLaMA2-7B | LoRA | 0.83 | 77.6 |
| LLaMA2-7B | DoRA | 0.84 | 79.7 (+2.1) |
| LLaMA3-8B | LoRA | 0.70 | 80.8 |
| LLaMA3-8B | DoRA | 0.71 | 85.2 (+4.4) |
What to notice: The improvement is not uniform across model sizes — it’s +3.7 on 7B, +1.0 on 13B, then bigger again (+2.1 and +4.4) on the newer architectures. This suggests the benefit of DoRA is not purely a function of model size but also of the “distance” between LoRA and FT’s optimal learning pattern, which may vary by architecture.
DoRA† (half the rank of LoRA) consistently beats LoRA with half the parameters: +2.8/+1.0/+2.9/+4.2 points on 7B/13B/2-7B/3-8B respectively. This is arguably more practically important than equal-rank comparisons — it means DoRA can match LoRA’s accuracy at 50% of the training cost.
3.2 Rank robustness analysis
Setup: Fix LLaMA-7B, vary rank for both LoRA and DoRA. Evaluate on commonsense reasoning.
Figure 5: Accuracy vs Rank (LLaMA-7B, Commonsense Reasoning)
| Rank | LoRA Avg. | DoRA Avg. | Delta |
|---|---|---|---|
| r=4 | 39.5 | 61.9 | +22.4 |
| r=8 | 40.7 | 77.9 | +37.2 |
| r=16 | 70.9 | 77.5 | +6.6 |
| r=32 | 74.7 | 78.4 | +3.7 |
| r=64 | 65.8 | 72.1 | +6.3 |
The most striking finding is the catastrophic failure of LoRA at r=4 and r=8 (39.5% and 40.7% — near random). DoRA maintains 61.9% at r=4 and 77.9% at r=8. This is a 37-point gap at r=8.
The explanation connects to §2.4: at very low ranks, LoRA’s coupled magnitude-direction updates waste gradient capacity — the limited rank budget must simultaneously correct both magnitude and direction. DoRA separates them, so even with a tiny (small directional budget), the trainable handles the magnitude correction, and the LoRA matrices focus entirely on directional updates.
3.3 Visual instruction tuning (LLaVA-1.5-7B)
Setup: Fine-tune LLaVA-1.5-7B (Vicuna-1.5-7B language model + CLIP ViT-L/336px vision encoder) on standard visual instruction tuning data. Evaluate on seven VL benchmarks: VQAv2, GQA, VisWiz, SQA, VQAT, POPE, MMBench.
| Method | #Params (%) | VQAv2 | GQA | VisWiz | SQA | VQAT | POPE | MMBench | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| FT (100%) | 100 | 78.5 | 61.9 | 50.0 | 66.8 | 58.2 | 85.9 | 64.3 | 66.5 |
| LoRA | 4.61 | 79.1 | 62.9 | 47.8 | 68.4 | 58.2 | 86.4 | 66.1 | 66.9 |
| DoRA | 4.63 | 78.6 | 62.9 | 52.2 | 69.9 | 57.0 | 87.2 | 66.1 | 67.6 |
Note: On VisWiz (visual QA for blind users, more challenging), DoRA improves by +4.4 points over LoRA. On SQA (science QA), +1.5 points. DoRA’s overall average of 67.6 beats both LoRA (66.9) and FT (66.5). The fact that DoRA beats FT here likely indicates that the training data setup causes FT to overfit, while DoRA’s constrained optimization (low-rank direction + scalar magnitude) provides implicit regularization.
3.4 Image/video-text understanding (VL-BART)
Setup: Fine-tune VL-BART (CLIP-ResNet101 + BARTBase) on four image-text tasks (VQAv2, GQA, NLVR2, MSCOCO) and four video-text tasks (TVQA, How2QA, TVC, YC2C).
| Task | FT | LoRA | DoRA |
|---|---|---|---|
| Image avg. | 77.3 | 76.5 | 77.4 (+0.9) |
| Video avg. | 83.5 | (see below) | 85.4 (+1.9) |
DoRA nearly matches FT on image-text tasks (77.4 vs 77.3) while using only 6% of parameters. The +1.9 point gap on video-text is especially notable since video-text requires stronger temporal reasoning — suggesting DoRA’s fine-grained update control benefits tasks with higher adaptation complexity.
3.5 Training sample robustness
Setup: Fine-tune LLaMA2-7B and LLaMA-7B on instruction-tuning subsets of Alpaca (1000, 4000, 7000, 10000 samples). Evaluate on MT-Bench.
Figure 6: MT-Bench vs Training Set Size (LLaMA2-7B)
| #Samples | LoRA | DoRA | VeRA | DVoRA |
|---|---|---|---|---|
| 1,000 | 5.41 | 5.70 | 5.21 | 5.43 |
| 4,000 | 5.55 | 5.82 | 5.38 | 5.60 |
| 7,000 | 5.68 | 5.98 | 5.40 | 5.71 |
| 10,000 | 5.70 | 6.00 | 5.50 | 6.00 |
DoRA’s advantage over LoRA grows as data decreases: at 1000 samples, the gap is +0.29; at 10000, it’s +0.30. This rules out the explanation that “DoRA just gets more from more data” — the benefit is stable. DVoRA at 0.04% parameters achieves the same score as DoRA at 2.33% parameters (6.00 on LLaMA2-7B with 10000 samples), a remarkable efficiency-accuracy tradeoff.
3.6 Tuning granularity: selective magnitude updates
DoRA’s analysis reveals that when directional updates dominate, magnitude changes are small. Exploiting this, the authors test a reduced granularity variant: apply full DoRA (direction + magnitude) to Q, K, V attention projections, but apply only magnitude updates to gate/up/down (MLP) projections.
| Method | #Params (%) | LLaMA-7B Avg. | LLaMA-13B Avg. |
|---|---|---|---|
| LoRA | 0.83 | 74.7 | 80.5 |
| DoRA (full) | 0.84 | 78.1 | 81.5 |
| DoRA (reduced) | 0.39 | 77.5 | 81.3 |
The reduced granularity variant uses 0.39% parameters (less than half of LoRA’s 0.83%) and still beats LoRA by +2.8/+0.8 points. This confirms the earlier observation that MLP weights primarily need magnitude correction, not directional rotation — splitting the budget accordingly is meaningful.
4. Design choices, alternatives, and boundary conditions
4.1 Why column-wise norms, not row-wise or global?
The choice to normalize column-wise () matches the linear algebra of the forward pass. For applied as where , each output dimension is (a row inner product). But each column corresponds to the weight vector associated with input dimension . The column-wise normalization ensures each such weight vector is a unit direction, with the scalar magnitude absorbing the scale.
Alternative: Row-wise normalization (normalize each row of and learn a magnitude per row). This is less natural because it would decompose the “output neuron” rather than the “input feature weight.” The column-wise decomposition is also consistent with the weight normalization paper (Salimans & Kingma), which operates per output neuron in fully-connected layers.
Boundary: In practice, may have columns with near-zero norms (dead neurons). The regularization in the Cholesky factor (analogous to the weight normalization paper’s guidance to initialize ) prevents division by zero.
4.2 What happens if we train directly instead of via LoRA?
If (the direction matrix) were trained directly (without low-rank constraint), DoRA would be equivalent to full FT — just a reparameterization with more trainable parameters (both and are trained). The LoRA constraint on is what makes DoRA parameter-efficient.
This also means DoRA does not improve full FT — it’s not a method for full FT training. It specifically improves over LoRA in the PEFT setting because the limited rank budget is used more efficiently when magnitude and direction are decoupled.
4.3 Does DoRA add inference overhead?
No. The magnitude and the LoRA matrices can all be merged into a single dense weight before deployment:
This is a one-time computation. The deployed model has identical architecture and inference cost to the original pretrained model. This is DoRA’s key advantage over adapter-based methods, which add latency via sequential/parallel insertion.
4.4 Why is DoRA better at low ranks than LoRA?
At very low ranks (e.g., ), LoRA must spend its limited expressiveness on both directional updates and magnitude corrections simultaneously. Since couples these, small rank means neither can be done well.
DoRA makes magnitude correction free (it’s just scalars, one per column, with no rank restriction). The rank budget is exclusively dedicated to directional updates. At , DoRA can still make full-rank magnitude corrections, while LoRA’s full-rank delta is compressed to rank-4 approximation of both effects together.
Boundary condition: As (full rank), LoRA approaches FT and DoRA’s advantage shrinks. The improvement is most pronounced at small and large models (where the gap between low-rank expressiveness and FT is largest).
4.5 Relationship to previous weight decomposition work
Weight normalization (Salimans & Kingma, 2016): Same mathematical decomposition, but (1) applied during pretraining from scratch, (2) both and are randomly initialized (sensitive), (3) motivates faster convergence via gradient covariance conditioning. DoRA uses the same decomposition for fine-tuning, initialized from pretrained weights (no sensitivity issue), and motivates it via learning pattern analysis rather than convergence speed.
SVD-based compression (SVD-LLM, ASVD): These methods approximate with a low-rank matrix by truncating singular values, for post-training compression. They do not train the model further. DoRA is a training method, not a compression method — the weight is not low-rank at deployment (it’s merged to a full dense matrix).
AdaLoRA: Adaptively allocates rank budget across layers by doing SVD on and pruning small singular values. It’s still a LoRA variant — all gradient energy goes into updating , with no explicit magnitude/direction separation. DoRA’s improvement comes from a fundamentally different mechanism.
4.6 QDoRA: combining with quantized backbones
QLoRA (Dettmers et al., NeurIPS 2023) quantizes the frozen backbone to 4-bit NF4 and applies LoRA adapters in full precision. QDoRA substitutes the LoRA component with DoRA:
On Orca-Math (100k math word problems), QDoRA achieves exact-match 0.27 on LLaMA2-7B versus QLoRA’s 0.08 — a 3.4× improvement. On LLaMA3-8B, QDoRA achieves 0.31 versus QLoRA’s 0.23. Notably, QDoRA slightly outperforms full FT (which requires much more memory) on these benchmarks.
5. Limitations and boundary conditions
5.1 Training memory
The base DoRA (without the detach trick in §2.3) requires computing gradients through the normalization , which increases the gradient graph depth and memory. The detach trick recovers most of this (−24.4% GPU memory on LLaMA-7B) at negligible accuracy cost. Still, DoRA requires slightly more memory than raw LoRA at equal rank because of the additional vector and the dynamic norm computation.
5.2 Hyperparameter sensitivity for learning rate
The magnitude and the LoRA matrices may prefer different learning rates. In the paper’s experiments, the same learning rate is used for all (with some per-experiment tuning), which works well but may not be optimal. Separate learning rate schedules for vs could potentially improve results further.
5.3 Task coverage
The commonsense reasoning benchmark suite used (following Hu et al., 2023) has known limitations: all tasks are multiple-choice, which may not fully represent instruction-following, generation quality, or reasoning-intensive tasks. The MT-Bench evaluations (GPT-4 scored) provide a more nuanced signal and confirm the trend.
5.4 Comparison with newer LoRA variants
The paper was submitted in Feb 2024 (ICML 2024 accepted). More recent variants (PiSSA, LoRA+, FLORA) have since emerged. Whether DoRA remains the Pareto-optimal PEFT method in 2025–2026 requires updated comparison — though the underlying gradient decoupling insight is structural and unlikely to be superseded by minor tweaks to LoRA.
5.5 Does DoRA help with alignment fine-tuning?
The paper demonstrates DoRA on SFT (supervised fine-tuning) and instruction-following, but not on RLHF or DPO. Since DoRA modifies the learning dynamics of the gradient update, it’s plausible (but unproven) that it also improves reward modeling or preference learning. This is an open direction.
6. Reproducibility
6.1 Code availability
Official PyTorch implementation: https://github.com/NVlabs/DoRA
DoRA is integrated into Hugging Face PEFT (supported by the HF PEFT team, acknowledged in the paper). Standard usage:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
use_dora=True, # ← DoRA flag
target_modules=["q_proj", "v_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
6.2 Replicating the commonsense reasoning benchmark
Data: Hu et al. (2023) training protocol — combine training data from 8 commonsense tasks.
LLaMA-7B hyperparameters (DoRA):
Rank r: 16 (full), 8 (DoRA†)
Alpha: 32
Dropout: 0.05
Optimizer: AdamW
LR: 2e-4
Scheduler: Linear decay
Batch size: 16
Warmup steps: 100
Epochs: 3
Target modules: Q, K, V, Up, Down
Expected throughput: On an A100 80GB, DoRA with on LLaMA-7B trains at approximately the same speed as LoRA (same FLOP count during forward; slight overhead in norm computation during backward).
6.3 Key ablation: verifying the detach trick
To verify the memory saving from §2.3 without accuracy loss:
# Standard DoRA (high memory):
W_prime = m * (V + delta_V) / (V + delta_V).norm(dim=0, keepdim=True)
# Efficient DoRA (detach norms from grad graph):
norms = (V + delta_V).norm(dim=0, keepdim=True).detach()
W_prime = m * (V + delta_V) / norms
Reported: 24.4% less GPU memory, 0.2 accuracy point difference on LLaMA-7B commonsense.
6.4 The weight decomposition diagnostic tool
The analysis tool from Section 3.2 of the paper can be implemented as:
import torch
def weight_decomp_analysis(W0, W_finetuned):
"""
Returns (delta_M, delta_D) for each column.
W0, W_finetuned: (d, k) tensors
"""
m0 = W0.norm(dim=0) # (k,)
m_ft = W_finetuned.norm(dim=0) # (k,)
v0 = W0 / m0.unsqueeze(0) # unit columns
v_ft = W_finetuned / m_ft.unsqueeze(0)
delta_M = (m_ft - m0).abs().mean().item()
# cosine similarity per column
cos_sim = (v0 * v_ft).sum(dim=0) # (k,)
delta_D = (1 - cos_sim).mean().item()
return delta_M, delta_D
This lets practitioners diagnose whether their LoRA vs FT gap is driven by magnitude issues, direction issues, or both — informing whether DoRA (or simpler magnitude-only tuning) is appropriate.
7. Summary and broader perspective
DoRA is a small but principled improvement to LoRA that is grounded in a concrete empirical observation. The key contributions are:
-
Diagnostic method (weight decomposition analysis): A simple, general tool to compare the learning patterns of any fine-tuning method against FT. This is independently valuable beyond DoRA.
-
DoRA: Decompose weights into magnitude and direction; train magnitude directly and direction via LoRA. The decomposition mechanistically explains the FT-LoRA accuracy gap and closes it by construction.
-
Empirical breadth: Consistent improvements across LLaMA/LLaMA2/LLaMA3, LLaVA, VL-BART, NLP and vision-language tasks, instruction tuning and commonsense reasoning, with and without quantization.
The broader lesson is about diagnostic-driven design: rather than proposing a new architecture or loss function and hoping it improves accuracy, the authors first characterized the structural difference between what they had (LoRA) and what they wanted (FT behavior), then designed the minimal change to close that gap. This methodology tends to produce methods that generalize well precisely because they fix a root cause rather than add complexity.
For practitioners working with constrained GPU budgets, DoRA’s most actionable results are:
- At low rank (): DoRA’s improvement over LoRA is massive (+22 to +37 points) and DoRA should be strongly preferred.
- At standard rank (–): DoRA gives consistent +2–4 point improvements with negligible overhead.
- DoRA† (half rank): If training budget is tight, halving the rank and using DoRA consistently outperforms standard LoRA.
- The HuggingFace PEFT integration (
use_dora=True) makes adoption a one-line change.
Appendix A: Extended Derivations
A.1 Full gradient derivation without the detach approximation
Without the detach optimization, the normalization is part of the computation graph. Using the chain rule through the column-norm operation:
For column of , let . The column norm is . The normalized direction is .
The loss gradient w.r.t. :
= \frac{m_n}{C_n} \left(I - \frac{v'_n v'^{\top}_n}{C_n^2}\right) \frac{\partial \mathcal{L}}{\partial W'_{:,n}} \tag{A.1}
This is a projection of the weight gradient onto the space orthogonal to . Equation (A.1) says: the gradient of (and thus ) is the component of the weight gradient that is perpendicular to the current direction. The component parallel to is absorbed by the magnitude gradient (Eq. 8). This is a cleaner, more formal statement of why DoRA decouples direction from magnitude.
The detach approximation drops the term, replacing Eq. (A.1) with simply (Eq. 7). The dropped term has magnitude proportional to . Since , this is , which is small when the gradient is nearly perpendicular to (i.e., when directional updates are needed) and comparable to the kept term when the gradient is aligned with (i.e., when magnitude updates are needed — but in that case, the is small anyway by the routing argument). So the dropped term is small in both relevant regimes. This is why the approximation loses only 0.2 accuracy points.
A.2 Proof sketch: DoRA preserves LoRA’s inference merge property
Claim: After training, DoRA weights can be merged into a dense matrix with no additional inference computation.
Proof: Let be the final trained values. Define:
This is a dense matrix in . At inference, for any input :
No additional computation is needed. The magnitude vector and LoRA matrices can be discarded. Memory footprint at inference = floats, identical to the original model.
The computation cost of computing is (one matrix multiply for , one column-norm computation, one elementwise multiply by ). This is done once and amortized across all inference calls.
A.3 Parameter count comparison across PEFT methods
For a single linear layer :
| Method | Trainable Params | Notes |
|---|---|---|
| Full FT | All params | |
| LoRA (rank ) | Both and | |
| DoRA (rank ) | , , plus magnitude | |
| AdaLoRA | Same as LoRA, but varies per layer | |
| VeRA | Only layer-specific scaling vectors | |
| DVoRA | VeRA vectors + DoRA magnitude | |
| Prefix (length ) | For each transformer layer |
For LLaMA-7B with and :
- LoRA: per layer
- DoRA: per layer (+3.1%)
- Full FT: per layer (128× more than DoRA)
The 3.1% parameter overhead of DoRA over LoRA (the magnitude scalars) is negligible in practice.
Appendix B: Implementation Details
B.1 HuggingFace PEFT implementation notes
The HuggingFace PEFT library implements DoRA as an extension of LoRA. Key implementation choices:
Column norm computation: PEFT computes column norms per forward pass, matching the paper’s “detach” variant. The implementation stores the computed norms as a buffer (not a parameter) to avoid redundant computation across the same layer.
Magnitude initialization: When use_dora=True, PEFT initializes lora_magnitude_vector (the vector) from the column norms of the pretrained weight. This is equivalent to the paper’s initialization .
Merge/unmerge: The PEFT library supports model.merge_adapter() and model.unmerge_adapter() for DoRA, correctly handling the normalization step in the merge computation.
B.2 Adapting DoRA for quantized models (QDoRA)
When using with BitsAndBytes 4-bit quantization:
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
use_dora=True, # QDoRA
target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
The dequantization step (NF4 → bfloat16) happens automatically during the column norm computation and the DoRA forward pass. Memory usage: typically 8–12 GB for LLaMA3-8B on a single GPU.
B.3 Diagnosing whether DoRA will help: the quick check
The question “should I use DoRA instead of LoRA for my task?” can be answered with the weight decomposition analysis. A simple heuristic:
- Train LoRA for a few hundred steps.
- Compute the Pearson correlation between across layers and checkpoints.
- If correlation > +0.3 (LoRA-like positive coupling), DoRA is likely to help.
- If correlation is already negative (FT-like), DoRA may give marginal improvement.
In practice, the correlation is usually strongly positive for LoRA (the paper found +0.83), so DoRA almost always helps when LoRA is used as the baseline.
Appendix C: Relationship to Other Low-Rank Methods
C.1 Spectral perspective on LoRA vs DoRA
The weight matrix has a singular value decomposition . The columns of can be expressed in terms of right singular vectors: .
LoRA’s update adds a rank- perturbation in the full space. There’s no explicit constraint on which singular directions are updated.
DoRA’s direction update also operates in the full space, but the normalization ensures that the column directions of are mapped to unit vectors before magnitude scaling. This implicitly prevents any single direction from dominating by keeping all column norms in a bounded range controlled by .
C.2 Why DoRA might outperform LoRA more on newer architectures
LLaMA-3-8B shows a larger DoRA improvement (+4.4 points) than LLaMA-7B (+3.7) despite similar parameter counts. Several factors may contribute:
Group Query Attention (GQA): LLaMA-3 uses GQA, which means key and value projections have fewer heads than query projections. The matrices have different aspect ratios, and their “natural” low-rank direction in the task-specific fine-tuning objective may diverge more from LoRA’s isotropic update space.
Rotary Position Embeddings (RoPE): The RoPE variant in LLaMA-3 (with different base frequency) may result in weight matrices where the task-relevant fine-tuning directions are more strongly separated from the pretrained directions, making the magnitude/direction decoupling more valuable.
Embedding layer scale: LLaMA-3 uses a larger vocabulary (128K tokens vs 32K in LLaMA), affecting the embedding weight matrices where DoRA’s column-wise normalization has the strongest effect.
These are hypotheses — the paper does not provide ablations on these architectural differences. An interesting future experiment would be to apply the weight decomposition analysis separately to each module type (q/k/v/o/gate/up/down) for LLaMA-3 to identify where the largest FT-vs-LoRA learning pattern difference occurs.
Appendix D: Extended Experimental Context
D.1 The commonsense reasoning benchmark suite
The eight tasks used in the commonsense reasoning evaluation are:
| Task | Type | Size (test) | Description |
|---|---|---|---|
| BoolQ | Binary QA | 3,270 | Reading comprehension, yes/no |
| PIQA | MC (2-choice) | 1,838 | Physical intuition QA |
| SIQA | MC (3-choice) | 1,954 | Social interaction QA |
| HellaSwag | MC (4-choice) | 10,003 | Sentence completion, activity |
| WinoGrande | Coreference | 1,267 | Winograd-style pronoun resolution |
| ARC-e (Easy) | MC (4-choice) | 2,376 | Science exam questions, easy |
| ARC-c (Challenge) | MC (4-choice) | 1,172 | Science exam questions, hard |
| OBQA (OpenBookQA) | MC (4-choice) | 500 | Open-book science questions |
These tasks vary widely in their linguistic demands. BoolQ requires careful passage reading; HellaSwag requires world knowledge about typical activity progressions; WinoGrande requires pronoun coreference with commonsense grounding. The combined training set contains 170,000+ examples across all tasks.
Following the LLM-Adapters protocol, all 8 training sets are combined for training, and evaluation is done on each task’s test set separately. The reported metric is accuracy (binary or multi-class), averaged across all 8 tasks for the summary number.
D.2 MT-Bench details and what the scores mean
MT-Bench evaluates 80 multi-turn conversations across 8 categories. The GPT-4 judge assigns each answer a score from 1 to 10. Interpreting the scores:
- < 4.0: Poor instruction following, frequent off-topic or incoherent answers
- 4.0 – 5.5: Below average; can follow simple instructions but struggles with multi-step reasoning
- 5.5 – 6.5: Average; capable model with some reasoning ability
- 6.5 – 7.5: Good; handles most MT-Bench categories well
- > 7.5: Excellent; comparable to commercial APIs
The LoRA baseline scores of 5.1–5.7 and DoRA’s 5.5–6.0 are in the “below average to average” range, consistent with these being relatively small 7–13B models fine-tuned on limited instruction data. The improvement from DoRA (+0.3–0.5) is meaningful given the benchmark’s resolution.
D.3 Variance and statistical significance
The paper reports single-run results for most experiments. The commonsense reasoning results are relatively stable (low-variance tasks with large test sets). MT-Bench results have more variance due to GPT-4 judge noise (estimated ±0.2 per run). The paper’s 0.3 DoRA improvement on MT-Bench should be interpreted with this in mind — it’s a consistent trend, not a precisely measured delta.
For practical deployment decisions, the rank robustness results (the 37-point gap at r=8) are the most statistically decisive finding, as the difference is far larger than any plausible variance.
D.4 Comparison to concurrent work
At the time of DoRA’s publication (Feb 2024), the primary concurrent PEFT works were:
-
PiSSA (arXiv 2404.02948): Also uses SVD of , but initializes LoRA and from the principal singular components rather than Kaiming/zero. PiSSA and DoRA target different root causes: PiSSA improves initialization, DoRA improves the structural coupling.
-
MoRA (arXiv 2405.12130): Replaces the two rectangular matrices with a single square matrix to allow higher-rank updates with the same parameter count. This is orthogonal to DoRA’s magnitude/direction decomposition.
-
LoRA+ (arXiv 2402.12354): Addresses the learning rate imbalance between and in LoRA. DoRA addresses a different problem (magnitude/direction coupling), and the two fixes could be combined.
None of these address the same root cause as DoRA, suggesting they could be combined. A DoRA variant with PiSSA-style initialization, LoRA+ learning rate scheduling, and DVoRA’s parameter efficiency has not been systematically studied but would be a natural next step.