May 23, 2026 EN #Reinforcement Learning #Reasoning #LLM Training

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Review date: 2026-05-23 Review author: Zhongzhu Zhou Paper reviewed: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper authors: DeepSeek-AI (Core: Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu et al.) arXiv: 2501.12948 Status/Venue: arXiv preprint (January 2025 / January 2026 v2), fully open-sourced at HuggingFace

Short Answer

DeepSeek-R1 demonstrates that a language model can develop sophisticated multi-step reasoning — including self-reflection, verification, and exploration of alternative approaches — purely through reinforcement learning against outcome-based rewards, with no human-annotated reasoning trajectories. The model matches or exceeds OpenAI-o1 on a wide range of benchmarks. This matters not just for the result but for the mechanism: it shows that RL is a genuine path to capability, not just alignment.

Prerequisites: What You Need to Know First

Before diving in, let me lay out the background concepts you’ll need to follow the technical argument.

1. Language Model Post-Training

A raw pre-trained LLM predicts next tokens; it is not yet a useful assistant. Post-training refers to the suite of techniques applied after pre-training to produce a helpful, harmless model. The classical recipe is:

Supervised Fine-Tuning (SFT): Train the model on (prompt, ideal response) pairs curated by humans or high-quality models. The model learns to mimic the style and content of the training corpus.
Reinforcement Learning from Human Feedback (RLHF): A separate reward model is trained to predict human preference between two responses. The LLM policy is then updated by PPO to maximize this reward. This is how InstructGPT, GPT-4, and Claude were aligned.

The key limitation: SFT caps model quality at human-demonstration quality. If the human annotators reason in a certain way, the model imitates that reasoning style — it can’t discover better strategies on its own.

2. Chain-of-Thought (CoT) Prompting

Chain-of-thought prompting (Wei et al., 2022) asks the model to produce intermediate reasoning steps before giving the final answer. This dramatically improves performance on math, logic, and science problems. The key insight is that long outputs can represent computation: the model can “think on paper.”

\text{Problem} \xrightarrow{\text{CoT}} \underbrace{\text{step}_1 \to \text{step}_2 \to \cdots \to \text{step}_k}_{\text{reasoning trace}} \to \text{Answer}

OpenAI’s o1 (2024) showed that making this explicit at training time (not just inference time) — teaching models to generate long internal monologues — dramatically boosts performance on hard math and coding benchmarks.

3. Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) is the standard RL algorithm used for RLHF. Recall the basic RL setup: a policy $\pi_\theta$ selects actions (tokens) in states (partial sequences) to maximize expected reward. PPO’s objective:

\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t,\ \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\, \hat{A}_t \right) \right]

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$ is the probability ratio and $\hat{A}_t$ is the advantage — how much better this action is relative to baseline.

The advantage requires a value model $V_\phi(s_t)$ that predicts expected future reward from state $s_t$ . This value model is usually as large as the policy model, doubling memory consumption.

\hat{A}_t^{\text{GAE}} = \sum_{l=0}^{T} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

Generalized Advantage Estimation (GAE, Schulman et al., 2015) with discount $\gamma$ and smoothing $\lambda$ is the standard way to reduce variance in this estimate. But $\lambda$ is notoriously sensitive to tune.

4. KL Divergence Penalty

In RLHF, the policy is constrained not to drift too far from a reference policy $\pi_\text{ref}$ (usually the SFT model). This prevents “reward hacking” where the model finds degenerate solutions that score well on the reward model but are nonsensical to humans. The KL penalty is:

D_\text{KL}(\pi_\theta \| \pi_\text{ref}) = \mathbb{E}_{o \sim \pi_\theta} \left[ \log \frac{\pi_\theta(o|q)}{\pi_\text{ref}(o|q)} \right]

An unbiased estimator (Schulman 2020) that avoids needing to sample from $\pi_\text{ref}$ :

D_\text{KL}^{\text{est}}(\pi_\theta \| \pi_\text{ref}) = \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1

This estimator is non-negative and equals zero only when $\pi_\theta = \pi_\text{ref}$ , making it suitable as a loss term.

The Core Contribution

DeepSeek-R1 makes two main claims:

Emergence without annotation (R1-Zero): A large LM can develop sophisticated multi-step reasoning patterns — self-reflection, verification, exploring alternatives — purely through RL with outcome-based rewards. No SFT, no human-written reasoning traces. The reasoning behaviors emerge from the optimization process itself.
Practical frontier reasoning (R1): A 4-stage pipeline that combines cold-start SFT, two stages of RL, and rejection-sampling SFT can produce a model competitive with OpenAI o1 while being fully open-source.

Part I: GRPO — Group Relative Policy Optimization

Why a New Algorithm?

PPO works but is expensive for long-CoT training for three reasons:

The value model doubles GPU memory.
The value model must predict expected future reward from partial sequences — but for a long reasoning chain where the model might revise earlier steps later, this prediction is extremely noisy.
PPO’s KL penalty enters as a per-token reward, which implicitly penalizes sequence length — bad for training models that should reason longer.

GRPO solves all three by eliminating the value model entirely.

The GRPO Objective

Figure 1: GRPO vs PPO Architecture

PPO:
  q → Policy → o → Reward Model → r
               ↕
          Value Model → v → GAE → Advantage Â

GRPO:
  q → Policy → {o₁, o₂, …, oG}
              ↓
       Reward Model → {r₁, r₂, …, rG}
              ↓
       Group statistics: mean(r), std(r)
              ↓
       Aᵢ = (rᵢ - mean) / std

For each question $q$ , GRPO samples a group of $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_{\theta_\text{old}}$ . It then optimizes:

\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}\left[ \frac{1}{G} \sum_{i=1}^G \min\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)} A_i,\ \text{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon\right) A_i \right) - \beta\, D_\text{KL}(\pi_\theta \| \pi_\text{ref}) \right] \tag{Eq. 1}

The advantage is computed directly from group scores:

A_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})} \tag{Eq. 2}

The KL term uses the unbiased estimator (Eq. 3):

D_\text{KL}(\pi_\theta \| \pi_\text{ref}) = \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1 \tag{Eq. 3}

Step-by-Step GRPO Forward Pass

Let me trace through one training step explicitly.

Step 1 — Sample group. For question $q$ , sample $G = 16$ responses $\{o_1, \ldots, o_{16}\}$ from $\pi_{\theta_\text{old}}$ with temperature 1. Each $o_i$ is a complete sequence up to 32,768 tokens (64K after step 8,200).

Step 2 — Score. Pass each $o_i$ through the reward function (rule-based for math/code: check final answer correctness + format compliance). Obtain $\{r_1, \ldots, r_{16}\}$ .

Step 3 — Normalize. Compute $A_i = (r_i - \bar{r}) / \sigma_r$ where $\bar{r}$ and $\sigma_r$ are the mean and standard deviation within this group. This is analogous to whitened advantages in PPO but requires no learned value function.

Step 4 — Gradient. Compute the GRPO loss (Eq. 1) using the policy ratios. The clip $[1-\varepsilon, 1+\varepsilon]$ with $\varepsilon = 10$ (! — much larger than PPO’s usual 0.2) limits how much the policy update can move. The large clip is a deliberate choice (more on this below).

Step 5 — KL regularization. The KL term (Eq. 3) is added to the loss with coefficient $\beta = 0.001$ . The reference policy is re-synced to the current policy every 400 steps.

Step 6 — Update. Standard gradient descent on $\pi_\theta$ , keeping $\pi_{\theta_\text{old}}$ frozen. The 8,192 outputs generated per rollout are split into 16 mini-batches and trained for a single inner epoch.

Why Does Normalizing Within the Group Work?

The group normalization transforms rewards into a zero-mean, unit-variance signal. This is important because:

Rewards on different problems have wildly different scales (a problem that always gives 0 or 1 vs. a partial-credit rubric).
Without normalization, the policy would update more aggressively on problems where rewards happen to be large-valued, leading to uneven learning.
The group baseline (mean) serves the same role as the value function baseline: subtracting it reduces variance without introducing bias.

The key insight: you don’t need to predict the baseline from a separate model. If you sample multiple outputs for the same question, the empirical mean is a very good baseline with zero additional parameters.

The Large Clip Ratio (ε = 10)

Standard PPO uses $\varepsilon \approx 0.2$ , meaning the policy can only change the token probability ratio by ±20% per update before the gradient is clipped. DeepSeek-R1 sets $\varepsilon = 10$ — this looks enormous, and it is.

Why? For long reasoning chains, many tokens in a correct response are “mundane” tokens that carry little information about why the response was correct or incorrect. A tight clip prevents the policy from moving these tokens at all, wasting the learning signal. With $\varepsilon = 10$ , the policy can make larger updates, allowing the gradient signal from the outcome reward to propagate back effectively across the 10K–30K token reasoning trace.

The risk: instability. The authors validate that the large clip doesn’t cause training instability in practice, likely because the KL regularization and the fact that most tokens still have small ratios provide implicit stability.

Part II: DeepSeek-R1-Zero — Pure RL Without SFT

Setup

Base model: DeepSeek-V3-Base (671B MoE, 37B active parameters)
No SFT: Training starts directly from the base checkpoint
Reward: Rule-based only. For math: is the final answer correct? For code: do test cases pass? For format: is the answer wrapped in <think>...</think><answer>...</answer> tags?

R_\text{rule} = R_\text{acc} + R_\text{format} \tag{Eq. 4}

The template is deliberately minimal:

User: {problem}
Assistant: <think> {reasoning process} </think> <answer> {answer} </answer>

No guidance on how to reason — only the structural format.

Training Dynamics

Figure 2: Training Trajectory of DeepSeek-R1-Zero

AIME 2024 accuracy                    Average response length
  1.0 |              ....r1-zero-cons@16|  20K |                     .....
  0.8 |           ...                  |  15K |                  ....
  0.6 |        ...                     |  10K |              ....
  0.4 |     ...             human      |   5K |          ....
  0.2 |  ...                baseline   |      |    .......
  0.0 |...________________________      |    0 |...__________________
      0      5K step     10K             0      5K step     10K

Two striking observations:

Accuracy climbs monotonically from 15.6% to 77.9% pass@1 on AIME 2024 (and to 86.7% with majority voting over 16 samples), surpassing the average human competitor score.
Response length grows organically from ~3,000 tokens to ~17,000 tokens per response. The model is “buying” more thinking time autonomously — the RL objective never explicitly rewards length.

The “Aha Moment”

Around training step 5,000, the model begins using the word “wait” as a self-correction signal within its <think> block:

<think>
... [initial approach] ...

Wait, wait. Wait. That's an aha moment I can flag here.
Let me reevaluate this step-by-step...

[revised approach]
</think>
<answer> ... </answer>

This wasn’t taught. The model discovered that pausing and re-examining its work leads to higher rewards, and converged on a verbal marker for this. It is a genuine emergent capability — not imitation of human-written reasoning traces, but discovered via RL.

The occurrence of “wait” in reflective contexts (tracked over training steps) shows a sharp phase transition around step 4,000–5,000, which corresponds exactly to the jump in AIME accuracy.

Why Does SFT Hurt?

This is the paper’s most provocative theoretical claim. The argument:

In SFT, the model is trained to reproduce human reasoning traces.
Humans have biases: they tend to write reasoning in specific ways, at specific lengths, with specific vocabulary.
This “constrains the exploration space” of the policy. The model learns to reason in human ways, capped by human quality.
In pure RL, the model can discover non-human reasoning strategies that are better optimized for the verifiable reward.

The alternative and its failure mode: Why not just do SFT-then-RL? The paper shows this works (DeepSeek-R1 uses exactly this), but the SFT-initialized policy is less free to explore novel patterns. R1-Zero explores more, but the SFT pretraining step is used in R1 with cold-start data specifically to address R1-Zero’s issues (language mixing, poor readability).

Part III: DeepSeek-R1 — The Full Multi-Stage Pipeline

Pipeline Overview

Figure 3: DeepSeek-R1 Four-Stage Training Pipeline

graph LR
    A[DeepSeek-V3-Base] --> B[Stage 1: Cold-Start SFT]
    B --> C[DeepSeek-R1-Dev1]
    C --> D[Stage 2: RL Stage 1\nreasoning-only rewards]
    D --> E[DeepSeek-R1-Dev2]
    E --> F[Stage 3: Rejection Sampling\n+ SFT on mixed data]
    F --> G[DeepSeek-R1-Dev3]
    G --> H[Stage 4: RL Stage 2\ndiversity + preference rewards]
    H --> I[DeepSeek-R1]

Stage 1: Cold-Start SFT

The problem with pure RL from base: responses can mix Chinese and English mid-thought, be poorly formatted, and have low readability even when correct. Cold-start SFT addresses this.

Data: “Thousands of” (small dataset) examples of conversational, human-aligned long CoT reasoning. Curated to exhibit:

Natural thinking process (not just final answers)
Language consistency
Proper use of <think> tags
Summary section after thinking

Effect: Dev1 vs R1-Zero shows big jumps in IF-Eval (instruction following) and Arena-Hard — the model learns to communicate better — but a dip in pure math performance (less free exploration). The cold start anchors the model in human communication patterns at the cost of some RL freedom.

Stage 2: First RL Stage (Reasoning Focus)

Same GRPO setup as R1-Zero, but:

Initialized from Dev1 (not raw base)
Additional reward: Language consistency reward (Eq. 5):

R_\text{language} = \frac{\text{Num}(\text{Words}_\text{target})}{\text{Num}(\text{Words})} \tag{Eq. 5}

This penalizes mixing Chinese and English within the CoT. It’s added directly to the final reward: $R = R_\text{rule} + R_\text{language}$ .

The ablation in the supplementary shows this trades ~1–2 points of reasoning accuracy for significantly better readability. The authors accept this tradeoff.

Training configuration (Stage 1):

LR: 3e-6
KL coefficient β: 0.001
Clip ratio ε: 10
Group size G: 16
Max sequence length: 32,768 (→ 65,536 after step 8,200)
Batch: 32 unique questions × 16 outputs = 512 per step
Reference model refreshed: every 400 steps

Stage 3: Rejection Sampling + SFT

After RL Stage 1, the model (Dev2) can produce high-quality reasoning chains. Now sample from Dev2 and filter:

Generate: For each prompt in the training set, sample multiple responses.
Filter: Keep only responses where the final answer is verifiable and correct.
SFT: Fine-tune on the filtered (correct) responses, both reasoning and non-reasoning data.

The non-reasoning data is critical: it teaches writing, question-answering, factual recall, and code engineering — tasks where rule-based verification is impossible.

Combined SFT dataset (Dev3):

High-quality reasoning traces: selected via rejection sampling from Dev2
Non-reasoning data: re-used from DeepSeek-V3’s SFT pipeline
Code engineering data: for Aider-Polyglot performance

Effect on Dev3 vs Dev2: +7 points on AlpacaEval 2.0, +19 points on Aider-Polyglot. General intelligence improves significantly; math/code is mostly preserved.

Stage 4: Second RL Stage (Diversity + Preference)

Final RL stage on Dev3. Two key changes from Stage 2:

Diverse data: Mix reasoning prompts with general instruction prompts.
Mixed rewards: Rule-based reward for reasoning; reward model for general data.

R = R_\text{reasoning} + R_\text{general} + R_\text{language} \tag{Eq. 6}

R_\text{general} = R_\text{reward\_model} + R_\text{format} \tag{Eq. 7}

The reward model itself is trained separately:

Helpful RM: 66,000 preference pairs. DeepSeek-V3 is prompted to generate two candidate responses for each query. They are scored four times with A/B randomized to reduce positional bias. Pairs with score difference $\Delta < 1$ are discarded for quality. The RM architecture = DeepSeek-R1 with a scalar reward head.

R_\text{helpful} = \text{RM}_\text{helpful}(\text{Response}_A, \text{Response}_B) \tag{Eq. 8}

Safety RM: 106,000 prompts with binary safe/unsafe labels. Pointwise classification (unlike pairwise helpful RM). The safety RM evaluates the entire response including the reasoning trace.

R_\text{safety} = \text{RM}_\text{safety}(\text{Response}) \tag{Eq. 9}

Stage 4 configuration:

LR: same as Stage 2
Temperature: 0.7 (reduced from 1.0 — higher temperatures cause incoherent generation at this stage)
Steps: 1,700 total; preference-based rewards added only in last 400 steps
Observation: more steps with model-based preference rewards → reward hacking; capped at 400 steps to prevent this

Why reduce temperature? At this stage the model already has strong priors from Stage 3. High temperature leads to incoherent responses, not creative exploration. Exploration is no longer needed — exploitation and alignment are the goals.

Part IV: Distillation to Smaller Models

Method

The stronger reasoning capabilities of DeepSeek-R1 can be transferred to smaller models through knowledge distillation — but not in the traditional sense (matching intermediate representations). Instead, they use SFT on model-generated data:

Figure 4: Distillation Pipeline

DeepSeek-R1 (671B)
       |
       | Generate 800K long-CoT reasoning traces
       | (mathematical, code, science problems)
       ↓
 Filter: keep only correct answers
       ↓
 SFT on filtered data
       ↓
 DeepSeek-R1-Distill-{Qwen, Llama}-{1.5B, 7B, 8B, 14B, 32B, 70B}

The data: 600K reasoning problems + 200K non-reasoning problems, generating trajectories from R1.

Pseudocode: Distillation Training Loop

1. For each base model M ∈ {Qwen-1.5B, Qwen-7B, Qwen-14B, Qwen-32B, Llama-8B, Llama-70B}:
2.   Initialize policy π = M
3.   For each (q, o*) in D_distill where o* is R1's accepted reasoning trace:
4.     Compute loss = CrossEntropy(π(o*|q), o*)  // SFT loss
5.   Update π by gradient descent
6.   Evaluate on AIME, MATH-500, LiveCodeBench

Key finding: Direct SFT on R1-generated traces substantially outperforms applying RL directly to small models. The paper also shows that distilling from DeepSeek-R1 outperforms distilling from DeepSeek-V3 for reasoning tasks — confirming that R1’s traces encode better reasoning patterns.

Distillation Results

Figure 5: Distilled Model Performance vs Size

Model	AIME 2024	MATH-500	LiveCodeBench
R1-Distill-Qwen-1.5B	28.9%	83.9%	34.1%
R1-Distill-Qwen-7B	55.5%	92.8%	54.9%
R1-Distill-Qwen-14B	69.7%	93.9%	64.7%
R1-Distill-Qwen-32B	72.6%	94.3%	69.4%
R1-Distill-Llama-8B	50.4%	89.1%	48.9%
R1-Distill-Llama-70B	70.0%	94.5%	65.7%
DeepSeek-R1 (671B)	79.8%	97.3%	65.9%
OpenAI-o1-mini	63.6%	90.0%	53.8%
OpenAI-o1	74.3%	96.4%	63.4%

The 1.5B model already matches QwQ-32B (a specialized 32B reasoning model) on several tasks. The 70B model is competitive with OpenAI o1 at a small fraction of the parameter count.

Why Distillation Works Better Than Small-Model RL

Directly training a 7B model with GRPO from scratch produces much worse results than SFT on R1’s traces. Why?

Capacity: Small models may lack the parameter budget to discover novel reasoning strategies from scratch. Large models have already done the exploration.
Training signal density: A 7B model solving AIME problems will succeed very rarely early in training, giving almost no positive reward signal for RL to work with. R1’s traces provide dense supervision.
Long CoT requirements: Generating coherent 10,000-token reasoning chains requires a certain base capability. Small models can’t do this reliably without guidance.

Part V: Experimental Results and Analysis

Main Benchmarks

Figure 6: DeepSeek-R1 vs State-of-the-Art (select benchmarks)

Benchmark	R1-Zero	R1	o1-0912	o1-mini	GPT-4o
AIME 2024 (Pass@1)	77.9	79.8	74.3	63.6	9.3
MATH-500 (Pass@1)	95.9	97.3	96.4	90.0	76.6
GPQA Diamond	75.8	71.5	77.3	60.0	53.6
LiveCodeBench	50.0	65.9	63.4	53.8	33.4
Codeforces (Rating)	1444	2029	1891	1820	759
MMLU	88.8	90.8	92.3	85.2	87.2
AlpacaEval2 (LC)	24.7	87.6	—	—	57.5

Key observations:

R1-Zero is already impressive but suffers on general tasks (AlpacaEval 24.7%).
R1’s multi-stage pipeline fixes general task performance (87.6% AlpacaEval) without losing math/code.
R1 on Codeforces achieves ELO 2029 — top 3% of human competitors.

Ablation: Why Not Skip the Cold Start?

The paper provides an implicit ablation: compare R1-Zero (no cold start, pure RL) with R1-Dev1 (cold start SFT, then RL). The results:

Dev1 vs R1-Zero on IF-Eval: 71.7% vs 46.6% — huge win for cold start on instruction following.
Dev1 vs R1-Zero on AIME 2024: 59.0% vs 77.9% — R1-Zero wins on pure math.

Interpretation: cold start SFT trades exploration freedom (slightly worse math) for communication quality (much better instruction following). The subsequent RL stages in R1 recover the math performance.

Ablation: Does the Language Consistency Reward Help?

Supplementary B.6 shows: removing $R_\text{language}$ gives +1-2 points on math benchmarks but produces significantly mixed-language outputs. The tradeoff is explicitly acknowledged and the authors accept the performance cost for usability.

Test-Time Compute Scaling

A key advantage of chain-of-thought reasoning models: you can trade inference compute for accuracy. DeepSeek-R1 supports:

Majority voting (cons@N): Generate N responses, take the majority answer. Scaling from N=1 to N=16 improves AIME 2024 from 79.8% to 87.2%.
Dynamic length allocation: Unlike MCTS or beam search, R1 naturally allocates more tokens to harder problems within a single generation, without external compute allocation.

Part VI: Design Choices, Alternatives, and Boundaries

Choice 1: Rule-Based Reward vs Neural Reward for Reasoning

What they did: For math, code, and logic problems, they use rule-based rewards (check the answer, compile the code). No neural reward model for these tasks.

Why: Neural reward models are susceptible to reward hacking during large-scale RL. As the policy diverges from the SFT model, it finds inputs that fool the reward model without being genuinely correct. For math, the ground truth answer is unambiguous — rule-based verification is 100% reliable.

What would happen if you used a neural reward? The policy would eventually learn to produce responses that score well on the neural RM but may not actually be correct. This is documented in Supplementary B.5 for the Stage 4 RM: more than 400 steps of preference RM training leads to reward hacking.

Boundary condition: This only works because math/code has verifiable answers. For writing, instruction following, open-ended QA — where there’s no ground truth — neural reward models are unavoidable. This is why Stage 4 uses neural RMs only for general data, limited to 400 steps.

Choice 2: GRPO Instead of PPO

What they did: Eliminate the value model, compute advantages from group statistics.

Why: Memory/compute savings are significant at 671B scale — avoiding a 671B value model. Also, for long reasoning chains, the value function is hard to learn accurately.

What would PPO do? From Figure 4 (comparison), PPO with default $\lambda=0.95$ performs significantly worse than GRPO. With careful tuning ( $\lambda=1.0$ ), PPO matches GRPO but requires extra hyperparameter search. At scale, the memory overhead of a value model makes PPO impractical.

Boundary: GRPO requires that multiple samples from the same question are meaningful — that rewards are comparable across samples from the same distribution. For tasks where all outputs reliably score near 0 or 1 (very easy or very hard problems), the group variance is near 0, providing no gradient signal. Problem selection (challenging but not impossible) matters enormously.

Choice 3: Multi-Stage Pipeline Instead of End-to-End RL

What they did: Four distinct training stages with different data, rewards, and objectives.

Why: End-to-end RL from base would likely fail because (a) language mixing would make outputs unreadable, (b) the policy would over-specialize in reasoning domains and neglect general tasks, and (c) the base model needs some anchoring in human communication styles before RL can effectively explore.

Alternative: OpenAI o1 reportedly uses a similar staged approach but details are not public. Pure end-to-end RL (R1-Zero) achieves strong reasoning but poor general performance.

Boundary: The multi-stage pipeline is more complex and each stage introduces hyperparameters. The authors mention reward hacking concerns at Stage 4 — this requires careful monitoring. This pipeline is not easily reproducible without significant infrastructure.

Choice 4: Distillation via SFT Rather Than RL Transfer

What they did: Generate 800K traces from R1, filter for correctness, SFT smaller models on these traces.

Why: Direct RL on small models fails to learn because the reward signal is too sparse (small models rarely get hard problems right initially). SFT on correct traces provides dense supervision.

What if they did RL on smaller models instead? The paper shows in Supplementary F that applying GRPO directly to a 7B model with math data produces much weaker results than SFT on R1’s traces. The key insight: capability transfer is more efficient than independent discovery.

Boundary: The distilled models inherit R1’s reasoning style, including its verbosity. For latency-sensitive applications, these models may be too slow. The distilled models also can’t improve beyond what R1 demonstrates.

Part VII: Limitations and Open Problems

The paper is admirably honest about limitations:

Structured output and tool use: R1 cannot yet call external tools reliably during reasoning. This is a major gap compared to agentic systems.
Token efficiency / “overthinking”: R1 sometimes uses many more tokens than necessary even for simple problems. The length growth is driven by the RL objective (correct answers get full reward regardless of length), not by genuine problem complexity.
Language mixing: The model is optimized for Chinese and English. Other languages trigger mixing issues.
Prompt sensitivity: Few-shot prompting degrades performance. R1 is designed for zero-shot use.
Software engineering tasks: Long evaluation times (running code test suites) made it impractical to apply RL extensively to software engineering. This is a systems problem, not a modeling problem.
Reward hacking: The pipeline requires careful stage-wise management. The preference RM at Stage 4 can be hacked if trained too long. Rule-based rewards break down for open-ended tasks.
Safety: The model’s safety level is “moderate” compared to GPT-4o without a risk control layer. Enhanced reasoning capability can make unsafe responses more operationally feasible.

Part VIII: Infrastructure and Systems Perspective

This is briefly covered in Supplementary B.1. The RL infrastructure uses:

vLLM for rollout generation (efficient batched inference for the policy model’s sample generation)
Overlapped execution: Rollout (inference), reward computation (code execution, answer matching, format checking), and training are pipelined to avoid idle GPU time.
Multi-node distributed training: The 671B model is distributed across many nodes with tensor parallelism and pipeline parallelism.
Reference model management: The reference model is updated every 400 steps, not kept frozen for the entire run. This allows the KL constraint to adapt as the policy improves, rather than constraining it to an increasingly irrelevant reference.

Figure 7: RL Training Infrastructure Dataflow

graph TD
    A[vLLM Workers\nRollout Generation] --> B[Reward Computation\nAnswer Matcher / Code Executor / Format Checker]
    B --> C[Advantage Computation\nGroup Normalize rᵢ → Aᵢ]
    C --> D[Actor Model Training\nGRPO Loss + KL Term]
    D -->|Every 400 steps: sync reference| E[Reference Model Update]
    A -->|Pack data for training| D

The key insight from the infrastructure: rollout is the bottleneck, not training. vLLM is chosen precisely because it can generate 8,192 outputs per rollout efficiently using PagedAttention.

Part IX: Why This Paper Matters

DeepSeek-R1’s significance goes beyond the benchmark numbers:

Proof of concept for RL-driven reasoning: Before this, it was unclear whether RL alone could produce the sophisticated chain-of-thought behaviors seen in o1. R1-Zero confirms it unambiguously.
Open source at frontier capability: The model weights and (importantly) the training recipe are public. This enables the research community to study and extend RL-for-reasoning at scale.
Distillation as a capability transfer mechanism: The distilled small models (1.5B–70B) are competitive with much larger non-reasoning models. This opens the possibility of deploying strong reasoning capability at low inference cost.
GRPO as a practical PPO alternative: The algorithm is simple, memory-efficient, and effective. It has been widely adopted in subsequent work (DAPO, Dr. GRPO, etc.).
Framework for verifiable tasks: The paper articulates clearly when RL works well (verifiable rewards, sufficient model capacity, challenging but not impossible problems) and when it doesn’t (open-ended tasks, small models, overly easy problems). This is a practical roadmap.

Reproducing Key Results

What would it take to reproduce DeepSeek-R1?

Base model: DeepSeek-V3-Base is open-source on HuggingFace (~640GB in bf16).
RL infrastructure: vLLM + custom GRPO training loop. OpenRLHF, verl, and TRL all provide GRPO implementations now.
Compute: The full run uses several thousand GPU-days on H800s. Stage 2 RL alone runs 10,400 steps with 512 samples per step × 30K tokens = ~1.6 × 10^8 tokens processed per step.
Reward functions: The math reward (checking final answer) and code reward (compiling + test case evaluation) are straightforward. Math datasets (MATH, AMC, AIME, Olympiad) are public.
Cold-start data: “Thousands of” examples — the paper doesn’t release this exact data but community reproductions (DeepScaleR, STILL-3, etc.) have built similar datasets.

Several community reproductions (NovaSky’s Sky-T1, Eurus-2, etc.) have partially reproduced DeepSeek-R1’s performance on smaller models with smaller compute budgets, validating the core training recipe.

Summary

DeepSeek-R1 advances the field on three levels simultaneously:

Algorithm: GRPO — a simpler, more memory-efficient alternative to PPO that works well for long-CoT RL
Training recipe: A 4-stage pipeline that combines cold-start SFT, staged RL, and rejection-sampling SFT to produce a general-purpose reasoning model
Systems insight: Distillation via SFT on model-generated traces is more efficient than RL on small models for capability transfer

The deepest contribution is the demonstration that reward-based RL can discover genuinely novel reasoning strategies — not mimicry of human demonstrations but behaviors that emerge from optimization pressure. The “aha moment” phenomenon and the organic growth of response length are empirical evidence that RL is doing something new, not just refining what SFT already taught.

The paper also sets honest boundaries: reward hacking, open-ended tasks, multilingual support, and prompt sensitivity are all identified as limitations. These are the roadmap for R2.

Appendix A: Detailed Mathematical Derivations

A.1 Deriving the GRPO Advantage Estimator

Let me derive why group normalization is an unbiased advantage estimator.

In standard RL, the advantage $A(s, a)$ is defined as:

A(s, a) = Q(s, a) - V(s)

where $Q(s, a)$ is the action-value function (expected return from taking action $a$ in state $s$ ) and $V(s) = \mathbb{E}_a[Q(s, a)]$ is the state-value function (baseline).

In the LLM setting, the “state” is the question $q$ and partial response, and the “action” is the entire output $o_i$ (since reward is only observed at the end). So:

A_i = R(o_i | q) - B(q)

where $B(q)$ is a baseline that depends only on the question (not the specific output). We need $B(q)$ to satisfy $\mathbb{E}[B(q)] = \mathbb{E}[R(o_i|q)]$ so that the advantage is zero-mean in expectation.

GRPO’s choice: $B(q) = \frac{1}{G}\sum_{i=1}^G r_i$ — the empirical mean within the group.

This is unbiased because:

\mathbb{E}\left[\frac{1}{G}\sum_{i=1}^G r_i\right] = \mathbb{E}[r_i] = Q(q)

where $Q(q) = \mathbb{E}_{o \sim \pi_\theta}[R(o|q)]$ is the expected reward for question $q$ .

The division by $\text{std}(\{r_1, \ldots, r_G\})$ is variance normalization — it doesn’t change the direction of the gradient but standardizes the scale, making the effective learning rate consistent across different problems.

Variance analysis: The group mean estimator has variance $\text{Var}[r_i] / G$ . With $G=16$ , this reduces variance by 16× compared to using a single sample, making GRPO significantly more stable than naive REINFORCE.

A.2 The KL Divergence Estimator’s Non-Negativity Proof

GRPO uses the estimator:

\hat{D}_\text{KL} = \frac{\pi_\text{ref}(o|q)}{\pi_\theta(o|q)} - \log \frac{\pi_\text{ref}(o|q)}{\pi_\theta(o|q)} - 1

Let $x = \frac{\pi_\text{ref}(o|q)}{\pi_\theta(o|q)}$ . Then $\hat{D}_\text{KL} = x - \log x - 1$ .

To show this is $\geq 0$ : let $f(x) = x - \log x - 1$ . Then $f'(x) = 1 - \frac{1}{x}$ , so $f'(x) = 0 \Leftrightarrow x = 1$ .

$f''(x) = \frac{1}{x^2} > 0$ for all $x > 0$ , so $x=1$ is a global minimum. $f(1) = 1 - 0 - 1 = 0$ .

Therefore $\hat{D}_\text{KL} \geq 0$ , with equality iff $\pi_\theta = \pi_\text{ref}$ . ∎

Why this estimator, not the standard $\mathbb{E}[\log(\pi_\theta / \pi_\text{ref})]$ ?

The standard KL requires computing $\log \pi_\theta(o_i|q) - \log \pi_\text{ref}(o_i|q)$ for each sample, then taking expectations. This requires either sampling from $\pi_\text{ref}$ or computing importance weights. The Schulman estimator only requires $\pi_\text{ref}(o_i|q)$ evaluated at samples from $\pi_\theta$ , which is cheap.

A.3 PPO vs GRPO: The Value Function Problem for Long CoT

Why is GAE particularly bad for long reasoning chains? Let’s trace through an example.

Consider a math problem. The correct solution requires:

Tokens 1–500: Setting up the problem correctly
Tokens 500–2,000: Attempting a first approach
Tokens 2,000–2,500: Realizing the approach is wrong (“Wait…”)
Tokens 2,500–8,000: Correct approach leading to answer

The value function at token 1 must predict the expected reward at token 8,000. But the reward depends on whether the model will eventually “realize” its mistake at token 2,000 and recover. Early in training, this prediction is close to random.

GAE requires:

V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^T \gamma^{k-t} r_k\right]

For $t = 1$ (first token), this requires predicting whether 8,000 tokens later the model will have gotten the answer right. The gradient signal for training $V_\phi$ at early positions is extremely noisy.

GRPO bypasses this entirely — the group mean gives a direct signal about whether this particular question tends to be solved correctly, without any position-specific prediction.

Appendix B: GRPO Algorithm Full Pseudocode

Algorithm: GRPO Training for DeepSeek-R1-Zero

Hyperparameters:
  G = 16          # group size
  ε = 10          # clip ratio
  β = 0.001       # KL coefficient
  lr = 3e-6       # learning rate
  T_ref = 400     # reference model refresh interval
  T_max = 10400   # total training steps
  n_rollout = 8192  # rollout batch size
  n_minibatch = 16  # mini-batches per rollout

Initialize:
  π_θ ← DeepSeek-V3-Base
  π_ref ← DeepSeek-V3-Base  (frozen initially)
  π_θ_old ← copy of π_θ

For step t = 1 to T_max:

  # === ROLLOUT PHASE ===
  Sample n_rollout questions Q = {q_1, ..., q_{n_rollout}} from training set

  For each q in Q:
    Sample G outputs {o_1,...,o_G} from π_θ_old(·|q) with temperature=1
    Score each: r_i = reward(o_i, q)   # rule-based: check answer + format

  Pack all (q, o_i, r_i) into dataset D_rollout
  Randomly split D_rollout into n_minibatch mini-batches

  # === TRAINING PHASE (single inner epoch) ===
  For each mini-batch B ⊆ D_rollout:

    For each (q, {(o_i, r_i)}) in B:

      # Compute group-normalized advantages
      r_mean = mean({r_i : i=1,...,G})
      r_std  = std({r_i : i=1,...,G}) + 1e-8  # epsilon for stability
      A_i = (r_i - r_mean) / r_std  for all i

      # Compute GRPO loss
      L_GRPO = 0
      For i = 1 to G:
        ratio_i = π_θ(o_i|q) / π_θ_old(o_i|q)
        clipped = clip(ratio_i, 1-ε, 1+ε)
        L_GRPO += min(ratio_i * A_i, clipped * A_i)
      L_GRPO /= G

      # Compute KL penalty
      KL_i = π_ref(o_i|q)/π_θ(o_i|q) - log(π_ref(o_i|q)/π_θ(o_i|q)) - 1
      L_KL = mean(KL_i for i=1,...,G)

      # Total loss (negative because we maximize)
      loss = -(L_GRPO - β * L_KL)

    Compute gradients, update π_θ

  # === REFERENCE MODEL UPDATE ===
  If t % T_ref == 0:
    π_ref ← copy of π_θ  # refresh reference to prevent KL from over-constraining

  # Update π_θ_old for next rollout
  π_θ_old ← copy of π_θ

Notes on the pseudocode:

Line “If t % T_ref == 0: π_ref ← copy of π_θ” is crucial. Without this, after 10,000 steps the policy has drifted so far from the initial base model that the KL constraint is irrelevant (the KL is always huge). Refreshing every 400 steps keeps the KL penalty meaningful.
The + 1e-8 in std computation prevents division by zero when all G outputs receive the same reward (e.g., all correct or all incorrect).
The single inner epoch avoids overfitting on the rollout batch.

C.1 How This Differs from InstructGPT / RLHF

Classic RLHF (Ouyang et al., 2022 — InstructGPT):

SFT on human demonstrations
Train RM on human preference comparisons
PPO to maximize RM score

Key differences in DeepSeek-R1:

No human preference comparisons: The reward is rule-based (is the answer correct?), not trained from human feedback.
No SFT before RL (R1-Zero): Classic RLHF always starts with SFT; R1-Zero skips this.
Outcome reward, not process reward: The RL signal comes only from the final answer correctness, not from intermediate steps.
Scale: Classic RLHF operated on models up to ~175B; R1 operates at 671B with a much longer context window.

C.2 Process Reward Models (PRMs) — The Road Not Taken

An alternative to outcome rewards (ORM) is process reward models that score each reasoning step. OpenAI’s “Let’s Verify Step by Step” (Lightman et al., 2023) showed that step-level feedback can improve math performance on GSM8K.

Why didn’t DeepSeek-R1 use PRMs?

PRMs require human annotation at the step level, which is expensive and hard to scale.
PRMs can be “fooled” by correct-looking but incorrect intermediate steps.
For the specific problem domains (competition math, code), outcome verification is cheap and reliable, making PRMs unnecessary.

The paper’s insight: for tasks with verifiable outcomes, outcome-level rewards are sufficient to develop sophisticated multi-step reasoning. You don’t need to tell the model which intermediate steps are correct.

C.3 Monte Carlo Tree Search (MCTS) — Another Road Not Taken

Some prior work (AlphaCode 2, various reasoning papers) used MCTS to perform tree search over reasoning paths at test time. DeepSeek-R1 explicitly mentions MCTS as a comparison point for “test-time compute scaling.”

Why not MCTS?

MCTS requires a learned value function — the same problem as PPO.
MCTS is not end-to-end differentiable; it requires a separate inference-time search procedure.
R1’s approach (generate a long CoT in a single pass) is simpler to implement and deploy.

The paper claims R1’s “dynamic length allocation” within a single pass is competitive with or better than MCTS for the benchmark tasks evaluated, though a comprehensive comparison is not provided.

C.4 Concurrent Work: OpenAI o1

The paper acknowledges OpenAI o1 as the direct comparison point. However, o1’s training details are not public. Based on available information:

o1 likely uses a multi-stage pipeline similar to DeepSeek-R1
o1 reportedly uses process reward models in training (unconfirmed)
o1 is closed-source; DeepSeek-R1 is fully open-source

The significance of DeepSeek-R1 is not just matching o1’s performance but providing the training recipe so the research community can build on it.

Appendix D: Figures Reference Guide

This review contains the following embedded diagrams:

Figure 1: GRPO vs PPO architecture comparison (ASCII)
Figure 2: R1-Zero training trajectory — accuracy and response length curves (ASCII)
Figure 3: DeepSeek-R1 four-stage training pipeline (Mermaid)
Figure 4: Distillation pipeline flowchart (ASCII)
Figure 5: Distilled model performance table (Markdown)
Figure 6: Main benchmark comparison table (Markdown)
Figure 7: RL training infrastructure dataflow (Mermaid)

The paper’s key experimental figures (Figure 1 in the paper: AIME accuracy curve and response length curve) show the most important empirical result: reasoning ability and response length both grow monotonically during RL training, demonstrating genuine capability emergence rather than rote memorization.