Review date: 2026-05-23 Review author: Zhongzhu Zhou Paper reviewed: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper authors: DeepSeek-AI (Core: Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu et al.) arXiv: 2501.12948 Status/Venue: arXiv preprint (January 2025 / January 2026 v2), fully open-sourced at HuggingFace
Short Answer
DeepSeek-R1 demonstrates that a language model can develop sophisticated multi-step reasoning — including self-reflection, verification, and exploration of alternative approaches — purely through reinforcement learning against outcome-based rewards, with no human-annotated reasoning trajectories. The model matches or exceeds OpenAI-o1 on a wide range of benchmarks. This matters not just for the result but for the mechanism: it shows that RL is a genuine path to capability, not just alignment.
Prerequisites: What You Need to Know First
Before diving in, let me lay out the background concepts you’ll need to follow the technical argument.
1. Language Model Post-Training
A raw pre-trained LLM predicts next tokens; it is not yet a useful assistant. Post-training refers to the suite of techniques applied after pre-training to produce a helpful, harmless model. The classical recipe is:
- Supervised Fine-Tuning (SFT): Train the model on (prompt, ideal response) pairs curated by humans or high-quality models. The model learns to mimic the style and content of the training corpus.
- Reinforcement Learning from Human Feedback (RLHF): A separate reward model is trained to predict human preference between two responses. The LLM policy is then updated by PPO to maximize this reward. This is how InstructGPT, GPT-4, and Claude were aligned.
The key limitation: SFT caps model quality at human-demonstration quality. If the human annotators reason in a certain way, the model imitates that reasoning style — it can’t discover better strategies on its own.
2. Chain-of-Thought (CoT) Prompting
Chain-of-thought prompting (Wei et al., 2022) asks the model to produce intermediate reasoning steps before giving the final answer. This dramatically improves performance on math, logic, and science problems. The key insight is that long outputs can represent computation: the model can “think on paper.”
OpenAI’s o1 (2024) showed that making this explicit at training time (not just inference time) — teaching models to generate long internal monologues — dramatically boosts performance on hard math and coding benchmarks.
3. Proximal Policy Optimization (PPO)
PPO (Schulman et al., 2017) is the standard RL algorithm used for RLHF. Recall the basic RL setup: a policy selects actions (tokens) in states (partial sequences) to maximize expected reward. PPO’s objective:
where is the probability ratio and is the advantage — how much better this action is relative to baseline.
The advantage requires a value model that predicts expected future reward from state . This value model is usually as large as the policy model, doubling memory consumption.
Generalized Advantage Estimation (GAE, Schulman et al., 2015) with discount and smoothing is the standard way to reduce variance in this estimate. But is notoriously sensitive to tune.
4. KL Divergence Penalty
In RLHF, the policy is constrained not to drift too far from a reference policy (usually the SFT model). This prevents “reward hacking” where the model finds degenerate solutions that score well on the reward model but are nonsensical to humans. The KL penalty is:
An unbiased estimator (Schulman 2020) that avoids needing to sample from :
This estimator is non-negative and equals zero only when , making it suitable as a loss term.
The Core Contribution
DeepSeek-R1 makes two main claims:
-
Emergence without annotation (R1-Zero): A large LM can develop sophisticated multi-step reasoning patterns — self-reflection, verification, exploring alternatives — purely through RL with outcome-based rewards. No SFT, no human-written reasoning traces. The reasoning behaviors emerge from the optimization process itself.
-
Practical frontier reasoning (R1): A 4-stage pipeline that combines cold-start SFT, two stages of RL, and rejection-sampling SFT can produce a model competitive with OpenAI o1 while being fully open-source.
Part I: GRPO — Group Relative Policy Optimization
Why a New Algorithm?
PPO works but is expensive for long-CoT training for three reasons:
- The value model doubles GPU memory.
- The value model must predict expected future reward from partial sequences — but for a long reasoning chain where the model might revise earlier steps later, this prediction is extremely noisy.
- PPO’s KL penalty enters as a per-token reward, which implicitly penalizes sequence length — bad for training models that should reason longer.
GRPO solves all three by eliminating the value model entirely.
The GRPO Objective
Figure 1: GRPO vs PPO Architecture
PPO:
q → Policy → o → Reward Model → r
↕
Value Model → v → GAE → Advantage Â
GRPO:
q → Policy → {o₁, o₂, …, oG}
↓
Reward Model → {r₁, r₂, …, rG}
↓
Group statistics: mean(r), std(r)
↓
Aᵢ = (rᵢ - mean) / std
For each question , GRPO samples a group of outputs from the current policy . It then optimizes:
The advantage is computed directly from group scores:
The KL term uses the unbiased estimator (Eq. 3):
Step-by-Step GRPO Forward Pass
Let me trace through one training step explicitly.
Step 1 — Sample group. For question , sample responses from with temperature 1. Each is a complete sequence up to 32,768 tokens (64K after step 8,200).
Step 2 — Score. Pass each through the reward function (rule-based for math/code: check final answer correctness + format compliance). Obtain .
Step 3 — Normalize. Compute where and are the mean and standard deviation within this group. This is analogous to whitened advantages in PPO but requires no learned value function.
Step 4 — Gradient. Compute the GRPO loss (Eq. 1) using the policy ratios. The clip with (! — much larger than PPO’s usual 0.2) limits how much the policy update can move. The large clip is a deliberate choice (more on this below).
Step 5 — KL regularization. The KL term (Eq. 3) is added to the loss with coefficient . The reference policy is re-synced to the current policy every 400 steps.
Step 6 — Update. Standard gradient descent on , keeping frozen. The 8,192 outputs generated per rollout are split into 16 mini-batches and trained for a single inner epoch.
Why Does Normalizing Within the Group Work?
The group normalization transforms rewards into a zero-mean, unit-variance signal. This is important because:
- Rewards on different problems have wildly different scales (a problem that always gives 0 or 1 vs. a partial-credit rubric).
- Without normalization, the policy would update more aggressively on problems where rewards happen to be large-valued, leading to uneven learning.
- The group baseline (mean) serves the same role as the value function baseline: subtracting it reduces variance without introducing bias.
The key insight: you don’t need to predict the baseline from a separate model. If you sample multiple outputs for the same question, the empirical mean is a very good baseline with zero additional parameters.
The Large Clip Ratio (ε = 10)
Standard PPO uses , meaning the policy can only change the token probability ratio by ±20% per update before the gradient is clipped. DeepSeek-R1 sets — this looks enormous, and it is.
Why? For long reasoning chains, many tokens in a correct response are “mundane” tokens that carry little information about why the response was correct or incorrect. A tight clip prevents the policy from moving these tokens at all, wasting the learning signal. With , the policy can make larger updates, allowing the gradient signal from the outcome reward to propagate back effectively across the 10K–30K token reasoning trace.
The risk: instability. The authors validate that the large clip doesn’t cause training instability in practice, likely because the KL regularization and the fact that most tokens still have small ratios provide implicit stability.
Part II: DeepSeek-R1-Zero — Pure RL Without SFT
Setup
- Base model: DeepSeek-V3-Base (671B MoE, 37B active parameters)
- No SFT: Training starts directly from the base checkpoint
- Reward: Rule-based only. For math: is the final answer correct? For code: do test cases pass? For format: is the answer wrapped in
<think>...</think><answer>...</answer>tags?
The template is deliberately minimal:
User: {problem}
Assistant: <think> {reasoning process} </think> <answer> {answer} </answer>
No guidance on how to reason — only the structural format.
Training Dynamics
Figure 2: Training Trajectory of DeepSeek-R1-Zero
AIME 2024 accuracy Average response length
1.0 | ....r1-zero-cons@16| 20K | .....
0.8 | ... | 15K | ....
0.6 | ... | 10K | ....
0.4 | ... human | 5K | ....
0.2 | ... baseline | | .......
0.0 |...________________________ | 0 |...__________________
0 5K step 10K 0 5K step 10K
Two striking observations:
-
Accuracy climbs monotonically from 15.6% to 77.9% pass@1 on AIME 2024 (and to 86.7% with majority voting over 16 samples), surpassing the average human competitor score.
-
Response length grows organically from ~3,000 tokens to ~17,000 tokens per response. The model is “buying” more thinking time autonomously — the RL objective never explicitly rewards length.
The “Aha Moment”
Around training step 5,000, the model begins using the word “wait” as a self-correction signal within its <think> block:
<think>
... [initial approach] ...
Wait, wait. Wait. That's an aha moment I can flag here.
Let me reevaluate this step-by-step...
[revised approach]
</think>
<answer> ... </answer>
This wasn’t taught. The model discovered that pausing and re-examining its work leads to higher rewards, and converged on a verbal marker for this. It is a genuine emergent capability — not imitation of human-written reasoning traces, but discovered via RL.
The occurrence of “wait” in reflective contexts (tracked over training steps) shows a sharp phase transition around step 4,000–5,000, which corresponds exactly to the jump in AIME accuracy.
Why Does SFT Hurt?
This is the paper’s most provocative theoretical claim. The argument:
- In SFT, the model is trained to reproduce human reasoning traces.
- Humans have biases: they tend to write reasoning in specific ways, at specific lengths, with specific vocabulary.
- This “constrains the exploration space” of the policy. The model learns to reason in human ways, capped by human quality.
- In pure RL, the model can discover non-human reasoning strategies that are better optimized for the verifiable reward.
The alternative and its failure mode: Why not just do SFT-then-RL? The paper shows this works (DeepSeek-R1 uses exactly this), but the SFT-initialized policy is less free to explore novel patterns. R1-Zero explores more, but the SFT pretraining step is used in R1 with cold-start data specifically to address R1-Zero’s issues (language mixing, poor readability).
Part III: DeepSeek-R1 — The Full Multi-Stage Pipeline
Pipeline Overview
Figure 3: DeepSeek-R1 Four-Stage Training Pipeline
graph LR
A[DeepSeek-V3-Base] --> B[Stage 1: Cold-Start SFT]
B --> C[DeepSeek-R1-Dev1]
C --> D[Stage 2: RL Stage 1\nreasoning-only rewards]
D --> E[DeepSeek-R1-Dev2]
E --> F[Stage 3: Rejection Sampling\n+ SFT on mixed data]
F --> G[DeepSeek-R1-Dev3]
G --> H[Stage 4: RL Stage 2\ndiversity + preference rewards]
H --> I[DeepSeek-R1]
Stage 1: Cold-Start SFT
The problem with pure RL from base: responses can mix Chinese and English mid-thought, be poorly formatted, and have low readability even when correct. Cold-start SFT addresses this.
Data: “Thousands of” (small dataset) examples of conversational, human-aligned long CoT reasoning. Curated to exhibit:
- Natural thinking process (not just final answers)
- Language consistency
- Proper use of
<think>tags - Summary section after thinking
Effect: Dev1 vs R1-Zero shows big jumps in IF-Eval (instruction following) and Arena-Hard — the model learns to communicate better — but a dip in pure math performance (less free exploration). The cold start anchors the model in human communication patterns at the cost of some RL freedom.
Stage 2: First RL Stage (Reasoning Focus)
Same GRPO setup as R1-Zero, but:
- Initialized from Dev1 (not raw base)
- Additional reward: Language consistency reward (Eq. 5):
This penalizes mixing Chinese and English within the CoT. It’s added directly to the final reward: .
The ablation in the supplementary shows this trades ~1–2 points of reasoning accuracy for significantly better readability. The authors accept this tradeoff.
Training configuration (Stage 1):
- LR: 3e-6
- KL coefficient β: 0.001
- Clip ratio ε: 10
- Group size G: 16
- Max sequence length: 32,768 (→ 65,536 after step 8,200)
- Batch: 32 unique questions × 16 outputs = 512 per step
- Reference model refreshed: every 400 steps
Stage 3: Rejection Sampling + SFT
After RL Stage 1, the model (Dev2) can produce high-quality reasoning chains. Now sample from Dev2 and filter:
- Generate: For each prompt in the training set, sample multiple responses.
- Filter: Keep only responses where the final answer is verifiable and correct.
- SFT: Fine-tune on the filtered (correct) responses, both reasoning and non-reasoning data.
The non-reasoning data is critical: it teaches writing, question-answering, factual recall, and code engineering — tasks where rule-based verification is impossible.
Combined SFT dataset (Dev3):
- High-quality reasoning traces: selected via rejection sampling from Dev2
- Non-reasoning data: re-used from DeepSeek-V3’s SFT pipeline
- Code engineering data: for Aider-Polyglot performance
Effect on Dev3 vs Dev2: +7 points on AlpacaEval 2.0, +19 points on Aider-Polyglot. General intelligence improves significantly; math/code is mostly preserved.
Stage 4: Second RL Stage (Diversity + Preference)
Final RL stage on Dev3. Two key changes from Stage 2:
- Diverse data: Mix reasoning prompts with general instruction prompts.
- Mixed rewards: Rule-based reward for reasoning; reward model for general data.
The reward model itself is trained separately:
Helpful RM: 66,000 preference pairs. DeepSeek-V3 is prompted to generate two candidate responses for each query. They are scored four times with A/B randomized to reduce positional bias. Pairs with score difference are discarded for quality. The RM architecture = DeepSeek-R1 with a scalar reward head.
Safety RM: 106,000 prompts with binary safe/unsafe labels. Pointwise classification (unlike pairwise helpful RM). The safety RM evaluates the entire response including the reasoning trace.
Stage 4 configuration:
- LR: same as Stage 2
- Temperature: 0.7 (reduced from 1.0 — higher temperatures cause incoherent generation at this stage)
- Steps: 1,700 total; preference-based rewards added only in last 400 steps
- Observation: more steps with model-based preference rewards → reward hacking; capped at 400 steps to prevent this
Why reduce temperature? At this stage the model already has strong priors from Stage 3. High temperature leads to incoherent responses, not creative exploration. Exploration is no longer needed — exploitation and alignment are the goals.
Part IV: Distillation to Smaller Models
Method
The stronger reasoning capabilities of DeepSeek-R1 can be transferred to smaller models through knowledge distillation — but not in the traditional sense (matching intermediate representations). Instead, they use SFT on model-generated data:
Figure 4: Distillation Pipeline
DeepSeek-R1 (671B)
|
| Generate 800K long-CoT reasoning traces
| (mathematical, code, science problems)
↓
Filter: keep only correct answers
↓
SFT on filtered data
↓
DeepSeek-R1-Distill-{Qwen, Llama}-{1.5B, 7B, 8B, 14B, 32B, 70B}
The data: 600K reasoning problems + 200K non-reasoning problems, generating trajectories from R1.
Pseudocode: Distillation Training Loop
1. For each base model M ∈ {Qwen-1.5B, Qwen-7B, Qwen-14B, Qwen-32B, Llama-8B, Llama-70B}:
2. Initialize policy π = M
3. For each (q, o*) in D_distill where o* is R1's accepted reasoning trace:
4. Compute loss = CrossEntropy(π(o*|q), o*) // SFT loss
5. Update π by gradient descent
6. Evaluate on AIME, MATH-500, LiveCodeBench
Key finding: Direct SFT on R1-generated traces substantially outperforms applying RL directly to small models. The paper also shows that distilling from DeepSeek-R1 outperforms distilling from DeepSeek-V3 for reasoning tasks — confirming that R1’s traces encode better reasoning patterns.
Distillation Results
Figure 5: Distilled Model Performance vs Size
| Model | AIME 2024 | MATH-500 | LiveCodeBench |
|---|---|---|---|
| R1-Distill-Qwen-1.5B | 28.9% | 83.9% | 34.1% |
| R1-Distill-Qwen-7B | 55.5% | 92.8% | 54.9% |
| R1-Distill-Qwen-14B | 69.7% | 93.9% | 64.7% |
| R1-Distill-Qwen-32B | 72.6% | 94.3% | 69.4% |
| R1-Distill-Llama-8B | 50.4% | 89.1% | 48.9% |
| R1-Distill-Llama-70B | 70.0% | 94.5% | 65.7% |
| DeepSeek-R1 (671B) | 79.8% | 97.3% | 65.9% |
| OpenAI-o1-mini | 63.6% | 90.0% | 53.8% |
| OpenAI-o1 | 74.3% | 96.4% | 63.4% |
The 1.5B model already matches QwQ-32B (a specialized 32B reasoning model) on several tasks. The 70B model is competitive with OpenAI o1 at a small fraction of the parameter count.
Why Distillation Works Better Than Small-Model RL
Directly training a 7B model with GRPO from scratch produces much worse results than SFT on R1’s traces. Why?
- Capacity: Small models may lack the parameter budget to discover novel reasoning strategies from scratch. Large models have already done the exploration.
- Training signal density: A 7B model solving AIME problems will succeed very rarely early in training, giving almost no positive reward signal for RL to work with. R1’s traces provide dense supervision.
- Long CoT requirements: Generating coherent 10,000-token reasoning chains requires a certain base capability. Small models can’t do this reliably without guidance.
Part V: Experimental Results and Analysis
Main Benchmarks
Figure 6: DeepSeek-R1 vs State-of-the-Art (select benchmarks)
| Benchmark | R1-Zero | R1 | o1-0912 | o1-mini | GPT-4o |
|---|---|---|---|---|---|
| AIME 2024 (Pass@1) | 77.9 | 79.8 | 74.3 | 63.6 | 9.3 |
| MATH-500 (Pass@1) | 95.9 | 97.3 | 96.4 | 90.0 | 76.6 |
| GPQA Diamond | 75.8 | 71.5 | 77.3 | 60.0 | 53.6 |
| LiveCodeBench | 50.0 | 65.9 | 63.4 | 53.8 | 33.4 |
| Codeforces (Rating) | 1444 | 2029 | 1891 | 1820 | 759 |
| MMLU | 88.8 | 90.8 | 92.3 | 85.2 | 87.2 |
| AlpacaEval2 (LC) | 24.7 | 87.6 | — | — | 57.5 |
Key observations:
- R1-Zero is already impressive but suffers on general tasks (AlpacaEval 24.7%).
- R1’s multi-stage pipeline fixes general task performance (87.6% AlpacaEval) without losing math/code.
- R1 on Codeforces achieves ELO 2029 — top 3% of human competitors.
Ablation: Why Not Skip the Cold Start?
The paper provides an implicit ablation: compare R1-Zero (no cold start, pure RL) with R1-Dev1 (cold start SFT, then RL). The results:
- Dev1 vs R1-Zero on IF-Eval: 71.7% vs 46.6% — huge win for cold start on instruction following.
- Dev1 vs R1-Zero on AIME 2024: 59.0% vs 77.9% — R1-Zero wins on pure math.
Interpretation: cold start SFT trades exploration freedom (slightly worse math) for communication quality (much better instruction following). The subsequent RL stages in R1 recover the math performance.
Ablation: Does the Language Consistency Reward Help?
Supplementary B.6 shows: removing gives +1-2 points on math benchmarks but produces significantly mixed-language outputs. The tradeoff is explicitly acknowledged and the authors accept the performance cost for usability.
Test-Time Compute Scaling
A key advantage of chain-of-thought reasoning models: you can trade inference compute for accuracy. DeepSeek-R1 supports:
- Majority voting (cons@N): Generate N responses, take the majority answer. Scaling from N=1 to N=16 improves AIME 2024 from 79.8% to 87.2%.
- Dynamic length allocation: Unlike MCTS or beam search, R1 naturally allocates more tokens to harder problems within a single generation, without external compute allocation.
Part VI: Design Choices, Alternatives, and Boundaries
Choice 1: Rule-Based Reward vs Neural Reward for Reasoning
What they did: For math, code, and logic problems, they use rule-based rewards (check the answer, compile the code). No neural reward model for these tasks.
Why: Neural reward models are susceptible to reward hacking during large-scale RL. As the policy diverges from the SFT model, it finds inputs that fool the reward model without being genuinely correct. For math, the ground truth answer is unambiguous — rule-based verification is 100% reliable.
What would happen if you used a neural reward? The policy would eventually learn to produce responses that score well on the neural RM but may not actually be correct. This is documented in Supplementary B.5 for the Stage 4 RM: more than 400 steps of preference RM training leads to reward hacking.
Boundary condition: This only works because math/code has verifiable answers. For writing, instruction following, open-ended QA — where there’s no ground truth — neural reward models are unavoidable. This is why Stage 4 uses neural RMs only for general data, limited to 400 steps.
Choice 2: GRPO Instead of PPO
What they did: Eliminate the value model, compute advantages from group statistics.
Why: Memory/compute savings are significant at 671B scale — avoiding a 671B value model. Also, for long reasoning chains, the value function is hard to learn accurately.
What would PPO do? From Figure 4 (comparison), PPO with default performs significantly worse than GRPO. With careful tuning (), PPO matches GRPO but requires extra hyperparameter search. At scale, the memory overhead of a value model makes PPO impractical.
Boundary: GRPO requires that multiple samples from the same question are meaningful — that rewards are comparable across samples from the same distribution. For tasks where all outputs reliably score near 0 or 1 (very easy or very hard problems), the group variance is near 0, providing no gradient signal. Problem selection (challenging but not impossible) matters enormously.
Choice 3: Multi-Stage Pipeline Instead of End-to-End RL
What they did: Four distinct training stages with different data, rewards, and objectives.
Why: End-to-end RL from base would likely fail because (a) language mixing would make outputs unreadable, (b) the policy would over-specialize in reasoning domains and neglect general tasks, and (c) the base model needs some anchoring in human communication styles before RL can effectively explore.
Alternative: OpenAI o1 reportedly uses a similar staged approach but details are not public. Pure end-to-end RL (R1-Zero) achieves strong reasoning but poor general performance.
Boundary: The multi-stage pipeline is more complex and each stage introduces hyperparameters. The authors mention reward hacking concerns at Stage 4 — this requires careful monitoring. This pipeline is not easily reproducible without significant infrastructure.
Choice 4: Distillation via SFT Rather Than RL Transfer
What they did: Generate 800K traces from R1, filter for correctness, SFT smaller models on these traces.
Why: Direct RL on small models fails to learn because the reward signal is too sparse (small models rarely get hard problems right initially). SFT on correct traces provides dense supervision.
What if they did RL on smaller models instead? The paper shows in Supplementary F that applying GRPO directly to a 7B model with math data produces much weaker results than SFT on R1’s traces. The key insight: capability transfer is more efficient than independent discovery.
Boundary: The distilled models inherit R1’s reasoning style, including its verbosity. For latency-sensitive applications, these models may be too slow. The distilled models also can’t improve beyond what R1 demonstrates.
Part VII: Limitations and Open Problems
The paper is admirably honest about limitations:
-
Structured output and tool use: R1 cannot yet call external tools reliably during reasoning. This is a major gap compared to agentic systems.
-
Token efficiency / “overthinking”: R1 sometimes uses many more tokens than necessary even for simple problems. The length growth is driven by the RL objective (correct answers get full reward regardless of length), not by genuine problem complexity.
-
Language mixing: The model is optimized for Chinese and English. Other languages trigger mixing issues.
-
Prompt sensitivity: Few-shot prompting degrades performance. R1 is designed for zero-shot use.
-
Software engineering tasks: Long evaluation times (running code test suites) made it impractical to apply RL extensively to software engineering. This is a systems problem, not a modeling problem.
-
Reward hacking: The pipeline requires careful stage-wise management. The preference RM at Stage 4 can be hacked if trained too long. Rule-based rewards break down for open-ended tasks.
-
Safety: The model’s safety level is “moderate” compared to GPT-4o without a risk control layer. Enhanced reasoning capability can make unsafe responses more operationally feasible.
Part VIII: Infrastructure and Systems Perspective
This is briefly covered in Supplementary B.1. The RL infrastructure uses:
- vLLM for rollout generation (efficient batched inference for the policy model’s sample generation)
- Overlapped execution: Rollout (inference), reward computation (code execution, answer matching, format checking), and training are pipelined to avoid idle GPU time.
- Multi-node distributed training: The 671B model is distributed across many nodes with tensor parallelism and pipeline parallelism.
- Reference model management: The reference model is updated every 400 steps, not kept frozen for the entire run. This allows the KL constraint to adapt as the policy improves, rather than constraining it to an increasingly irrelevant reference.
Figure 7: RL Training Infrastructure Dataflow
graph TD
A[vLLM Workers\nRollout Generation] --> B[Reward Computation\nAnswer Matcher / Code Executor / Format Checker]
B --> C[Advantage Computation\nGroup Normalize rᵢ → Aᵢ]
C --> D[Actor Model Training\nGRPO Loss + KL Term]
D -->|Every 400 steps: sync reference| E[Reference Model Update]
A -->|Pack data for training| D
The key insight from the infrastructure: rollout is the bottleneck, not training. vLLM is chosen precisely because it can generate 8,192 outputs per rollout efficiently using PagedAttention.
Part IX: Why This Paper Matters
DeepSeek-R1’s significance goes beyond the benchmark numbers:
-
Proof of concept for RL-driven reasoning: Before this, it was unclear whether RL alone could produce the sophisticated chain-of-thought behaviors seen in o1. R1-Zero confirms it unambiguously.
-
Open source at frontier capability: The model weights and (importantly) the training recipe are public. This enables the research community to study and extend RL-for-reasoning at scale.
-
Distillation as a capability transfer mechanism: The distilled small models (1.5B–70B) are competitive with much larger non-reasoning models. This opens the possibility of deploying strong reasoning capability at low inference cost.
-
GRPO as a practical PPO alternative: The algorithm is simple, memory-efficient, and effective. It has been widely adopted in subsequent work (DAPO, Dr. GRPO, etc.).
-
Framework for verifiable tasks: The paper articulates clearly when RL works well (verifiable rewards, sufficient model capacity, challenging but not impossible problems) and when it doesn’t (open-ended tasks, small models, overly easy problems). This is a practical roadmap.
Reproducing Key Results
What would it take to reproduce DeepSeek-R1?
- Base model: DeepSeek-V3-Base is open-source on HuggingFace (~640GB in bf16).
- RL infrastructure: vLLM + custom GRPO training loop. OpenRLHF, verl, and TRL all provide GRPO implementations now.
- Compute: The full run uses several thousand GPU-days on H800s. Stage 2 RL alone runs 10,400 steps with 512 samples per step × 30K tokens = ~1.6 × 10^8 tokens processed per step.
- Reward functions: The math reward (checking final answer) and code reward (compiling + test case evaluation) are straightforward. Math datasets (MATH, AMC, AIME, Olympiad) are public.
- Cold-start data: “Thousands of” examples — the paper doesn’t release this exact data but community reproductions (DeepScaleR, STILL-3, etc.) have built similar datasets.
Several community reproductions (NovaSky’s Sky-T1, Eurus-2, etc.) have partially reproduced DeepSeek-R1’s performance on smaller models with smaller compute budgets, validating the core training recipe.
Summary
DeepSeek-R1 advances the field on three levels simultaneously:
- Algorithm: GRPO — a simpler, more memory-efficient alternative to PPO that works well for long-CoT RL
- Training recipe: A 4-stage pipeline that combines cold-start SFT, staged RL, and rejection-sampling SFT to produce a general-purpose reasoning model
- Systems insight: Distillation via SFT on model-generated traces is more efficient than RL on small models for capability transfer
The deepest contribution is the demonstration that reward-based RL can discover genuinely novel reasoning strategies — not mimicry of human demonstrations but behaviors that emerge from optimization pressure. The “aha moment” phenomenon and the organic growth of response length are empirical evidence that RL is doing something new, not just refining what SFT already taught.
The paper also sets honest boundaries: reward hacking, open-ended tasks, multilingual support, and prompt sensitivity are all identified as limitations. These are the roadmap for R2.
Appendix A: Detailed Mathematical Derivations
A.1 Deriving the GRPO Advantage Estimator
Let me derive why group normalization is an unbiased advantage estimator.
In standard RL, the advantage is defined as:
where is the action-value function (expected return from taking action in state ) and is the state-value function (baseline).
In the LLM setting, the “state” is the question and partial response, and the “action” is the entire output (since reward is only observed at the end). So:
where is a baseline that depends only on the question (not the specific output). We need to satisfy so that the advantage is zero-mean in expectation.
GRPO’s choice: — the empirical mean within the group.
This is unbiased because:
where is the expected reward for question .
The division by is variance normalization — it doesn’t change the direction of the gradient but standardizes the scale, making the effective learning rate consistent across different problems.
Variance analysis: The group mean estimator has variance . With , this reduces variance by 16× compared to using a single sample, making GRPO significantly more stable than naive REINFORCE.
A.2 The KL Divergence Estimator’s Non-Negativity Proof
GRPO uses the estimator:
Let . Then .
To show this is : let . Then , so .
for all , so is a global minimum. .
Therefore , with equality iff . ∎
Why this estimator, not the standard ?
The standard KL requires computing for each sample, then taking expectations. This requires either sampling from or computing importance weights. The Schulman estimator only requires evaluated at samples from , which is cheap.
A.3 PPO vs GRPO: The Value Function Problem for Long CoT
Why is GAE particularly bad for long reasoning chains? Let’s trace through an example.
Consider a math problem. The correct solution requires:
- Tokens 1–500: Setting up the problem correctly
- Tokens 500–2,000: Attempting a first approach
- Tokens 2,000–2,500: Realizing the approach is wrong (“Wait…”)
- Tokens 2,500–8,000: Correct approach leading to answer
The value function at token 1 must predict the expected reward at token 8,000. But the reward depends on whether the model will eventually “realize” its mistake at token 2,000 and recover. Early in training, this prediction is close to random.
GAE requires:
For (first token), this requires predicting whether 8,000 tokens later the model will have gotten the answer right. The gradient signal for training at early positions is extremely noisy.
GRPO bypasses this entirely — the group mean gives a direct signal about whether this particular question tends to be solved correctly, without any position-specific prediction.
Appendix B: GRPO Algorithm Full Pseudocode
Algorithm: GRPO Training for DeepSeek-R1-Zero
Hyperparameters:
G = 16 # group size
ε = 10 # clip ratio
β = 0.001 # KL coefficient
lr = 3e-6 # learning rate
T_ref = 400 # reference model refresh interval
T_max = 10400 # total training steps
n_rollout = 8192 # rollout batch size
n_minibatch = 16 # mini-batches per rollout
Initialize:
π_θ ← DeepSeek-V3-Base
π_ref ← DeepSeek-V3-Base (frozen initially)
π_θ_old ← copy of π_θ
For step t = 1 to T_max:
# === ROLLOUT PHASE ===
Sample n_rollout questions Q = {q_1, ..., q_{n_rollout}} from training set
For each q in Q:
Sample G outputs {o_1,...,o_G} from π_θ_old(·|q) with temperature=1
Score each: r_i = reward(o_i, q) # rule-based: check answer + format
Pack all (q, o_i, r_i) into dataset D_rollout
Randomly split D_rollout into n_minibatch mini-batches
# === TRAINING PHASE (single inner epoch) ===
For each mini-batch B ⊆ D_rollout:
For each (q, {(o_i, r_i)}) in B:
# Compute group-normalized advantages
r_mean = mean({r_i : i=1,...,G})
r_std = std({r_i : i=1,...,G}) + 1e-8 # epsilon for stability
A_i = (r_i - r_mean) / r_std for all i
# Compute GRPO loss
L_GRPO = 0
For i = 1 to G:
ratio_i = π_θ(o_i|q) / π_θ_old(o_i|q)
clipped = clip(ratio_i, 1-ε, 1+ε)
L_GRPO += min(ratio_i * A_i, clipped * A_i)
L_GRPO /= G
# Compute KL penalty
KL_i = π_ref(o_i|q)/π_θ(o_i|q) - log(π_ref(o_i|q)/π_θ(o_i|q)) - 1
L_KL = mean(KL_i for i=1,...,G)
# Total loss (negative because we maximize)
loss = -(L_GRPO - β * L_KL)
Compute gradients, update π_θ
# === REFERENCE MODEL UPDATE ===
If t % T_ref == 0:
π_ref ← copy of π_θ # refresh reference to prevent KL from over-constraining
# Update π_θ_old for next rollout
π_θ_old ← copy of π_θ
Notes on the pseudocode:
- Line “If t % T_ref == 0: π_ref ← copy of π_θ” is crucial. Without this, after 10,000 steps the policy has drifted so far from the initial base model that the KL constraint is irrelevant (the KL is always huge). Refreshing every 400 steps keeps the KL penalty meaningful.
- The
+ 1e-8in std computation prevents division by zero when all G outputs receive the same reward (e.g., all correct or all incorrect). - The single inner epoch avoids overfitting on the rollout batch.
Appendix C: Related Work Context
C.1 How This Differs from InstructGPT / RLHF
Classic RLHF (Ouyang et al., 2022 — InstructGPT):
- SFT on human demonstrations
- Train RM on human preference comparisons
- PPO to maximize RM score
Key differences in DeepSeek-R1:
- No human preference comparisons: The reward is rule-based (is the answer correct?), not trained from human feedback.
- No SFT before RL (R1-Zero): Classic RLHF always starts with SFT; R1-Zero skips this.
- Outcome reward, not process reward: The RL signal comes only from the final answer correctness, not from intermediate steps.
- Scale: Classic RLHF operated on models up to ~175B; R1 operates at 671B with a much longer context window.
C.2 Process Reward Models (PRMs) — The Road Not Taken
An alternative to outcome rewards (ORM) is process reward models that score each reasoning step. OpenAI’s “Let’s Verify Step by Step” (Lightman et al., 2023) showed that step-level feedback can improve math performance on GSM8K.
Why didn’t DeepSeek-R1 use PRMs?
- PRMs require human annotation at the step level, which is expensive and hard to scale.
- PRMs can be “fooled” by correct-looking but incorrect intermediate steps.
- For the specific problem domains (competition math, code), outcome verification is cheap and reliable, making PRMs unnecessary.
The paper’s insight: for tasks with verifiable outcomes, outcome-level rewards are sufficient to develop sophisticated multi-step reasoning. You don’t need to tell the model which intermediate steps are correct.
C.3 Monte Carlo Tree Search (MCTS) — Another Road Not Taken
Some prior work (AlphaCode 2, various reasoning papers) used MCTS to perform tree search over reasoning paths at test time. DeepSeek-R1 explicitly mentions MCTS as a comparison point for “test-time compute scaling.”
Why not MCTS?
- MCTS requires a learned value function — the same problem as PPO.
- MCTS is not end-to-end differentiable; it requires a separate inference-time search procedure.
- R1’s approach (generate a long CoT in a single pass) is simpler to implement and deploy.
The paper claims R1’s “dynamic length allocation” within a single pass is competitive with or better than MCTS for the benchmark tasks evaluated, though a comprehensive comparison is not provided.
C.4 Concurrent Work: OpenAI o1
The paper acknowledges OpenAI o1 as the direct comparison point. However, o1’s training details are not public. Based on available information:
- o1 likely uses a multi-stage pipeline similar to DeepSeek-R1
- o1 reportedly uses process reward models in training (unconfirmed)
- o1 is closed-source; DeepSeek-R1 is fully open-source
The significance of DeepSeek-R1 is not just matching o1’s performance but providing the training recipe so the research community can build on it.
Appendix D: Figures Reference Guide
This review contains the following embedded diagrams:
- Figure 1: GRPO vs PPO architecture comparison (ASCII)
- Figure 2: R1-Zero training trajectory — accuracy and response length curves (ASCII)
- Figure 3: DeepSeek-R1 four-stage training pipeline (Mermaid)
- Figure 4: Distillation pipeline flowchart (ASCII)
- Figure 5: Distilled model performance table (Markdown)
- Figure 6: Main benchmark comparison table (Markdown)
- Figure 7: RL training infrastructure dataflow (Mermaid)
The paper’s key experimental figures (Figure 1 in the paper: AIME accuracy curve and response length curve) show the most important empirical result: reasoning ability and response length both grow monotonically during RL training, demonstrating genuine capability emergence rather than rote memorization.