June 6, 2026 EN #Reinforcement Learning #Reasoning #LLM Training

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Review date: 2026-06-06 Review author: Zhongzhu Zhou Paper reviewed: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper authors: DeepSeek-AI (Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al.) arXiv: 2501.12948 Status / Venue: arXiv 2025-01; published in Nature 2026

Short Answer

DeepSeek-R1 proves that chain-of-thought reasoning — including self-reflection, verification, and dynamic strategy adaptation — can emerge spontaneously from pure outcome-based reinforcement learning on a pre-trained LLM, without any human-labeled reasoning traces. The resulting model rivals OpenAI o1 on math olympiad and competitive programming benchmarks, and its reasoning patterns can be distilled into models as small as 1.5B parameters.

Prerequisites

Before diving into the paper, you need to be comfortable with these ideas:

Transformer language models. A pre-trained LLM like DeepSeek-V3-Base produces a probability distribution over the next token given a prefix. The model is parameterized by weights $\theta$ , and the forward pass computes $\pi_\theta(o | q)$ , the probability of output sequence $o$ given question $q$ .

Chain-of-thought (CoT) prompting. Rather than predicting the final answer directly, the model generates intermediate reasoning steps. This has been shown (Wei et al., 2022) to dramatically improve performance on multi-step problems.

Supervised Fine-Tuning (SFT). The standard post-training recipe. Given a dataset of (question, correct-answer) pairs, minimize cross-entropy loss. Effective but limited by the quality and quantity of human annotations.

Reinforcement Learning for LLMs. The model is treated as a policy $\pi_\theta$ . A reward signal evaluates its output, and the policy is updated to maximize expected reward. The key challenge: the reward is only available after generating a complete response (sparse reward), and the action space is the vocabulary (tens of thousands of tokens).

PPO (Proximal Policy Optimization). The dominant RL algorithm for LLMs. PPO clips the policy-ratio to keep updates stable, and uses a separately trained value model (critic) to compute advantages via Generalized Advantage Estimation (GAE).

KL divergence. A measure of how much the new policy $\pi_\theta$ diverges from the reference policy $\pi_\text{ref}$ . Used to regularize RL training and prevent the model from drifting too far from the pre-trained distribution.

RLHF (Reinforcement Learning from Human Feedback). The standard recipe for aligning LLMs: train a reward model from human preferences, then use PPO to maximize that reward. InstructGPT and ChatGPT were trained this way.

Rejection Sampling. Generate many candidate outputs, keep only those that pass some filter (e.g., final answer is correct), and use those as SFT data. A simple but effective self-improvement technique.

1. Background and Motivation

The history of reasoning in LLMs tells a consistent story: more human supervision leads to better reasoning. Chain-of-thought prompting needed carefully curated few-shot examples. OpenAI’s o1 model reportedly required enormous investments in human-labeled reasoning trajectories. The dominant assumption was that you cannot get reasoning without showing the model how to reason.

DeepSeek-R1 challenges this assumption head-on. The core insight is deceptively simple: if you give a model a hard enough question, a reliable verifier, and enough compute for RL, reasoning will emerge on its own.

The paper makes three contributions:

DeepSeek-R1-Zero: A model trained with zero SFT, zero human reasoning traces, pure RL only. It develops self-reflection and verification organically.
DeepSeek-R1: A production-quality model that fixes R1-Zero’s readability and language-mixing issues through a carefully designed multi-stage training pipeline.
DeepSeek-R1-Distill series: Six open-source models (1.5B to 70B) distilled from DeepSeek-R1, showing reasoning transfers to small models.

The key enablers were: (a) a powerful base model (DeepSeek-V3-Base, 671B total parameters, MoE architecture with 37B active per token), (b) GRPO — a value-model-free RL algorithm that makes large-scale training tractable, and (c) rule-based verifiers for math and code that provide reliable, hackable-proof rewards.

2. GRPO: The RL Engine Behind DeepSeek-R1

2.1 Why not just use PPO?

PPO works as follows: generate an output, estimate its advantage using a value model (which predicts cumulative reward from partial outputs), and update the policy using a clipped objective. The problem with PPO for LLM reasoning training is three-fold:

Value model overhead. The value model is typically the same size as the policy model. For a 671B model, this means roughly doubling GPU memory and compute requirements.
Value prediction is hard for long CoT. The value model must predict the eventual outcome reward from partial outputs. But in reasoning chains with hundreds to thousands of tokens, the model frequently revises earlier statements (“wait, actually…”). This makes predicting final outcomes from intermediate states nearly intractable.
Per-token KL penalty implicitly penalizes length. PPO adds KL regularization as a dense per-token reward. Since RL maximizes cumulative reward, longer responses accumulate more KL penalty, which implicitly discourages the model from generating long chains of thought — the opposite of what you want.

2.2 GRPO: Group Relative Policy Optimization

GRPO (Shao et al., 2024) eliminates the value model entirely. Instead of estimating advantages from a learned critic, GRPO estimates them from the relative performance within a group of sampled outputs.

Algorithm 1: GRPO Training Loop

Input: Base model π_θ_old, training questions Q, reward function R(o, q)
Hyperparameters: group size G=16, clip ratio ε=10, KL coeff β=0.001

for each training step:
    1. Sample batch of 32 unique questions {q_1, ..., q_32} from Q
    2. For each question q:
       a. Sample G=16 outputs {o_1, ..., o_G} from current policy π_θ_old
       b. Compute rewards {r_1, ..., r_G} using rule-based verifier R
       c. Compute group-normalized advantages:
              A_i = (r_i - mean({r_j})) / std({r_j})
    3. Update policy by maximizing GRPO objective J_GRPO(θ)
    4. Every 400 steps: replace π_ref with current policy π_θ

The core GRPO objective is:

\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q),\, \{o_i\}_{i=1}^G \sim \pi_{\theta_\text{old}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^G \left( \min\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)} A_i,\; \text{clip}\!\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon\right) A_i \right) - \beta\, \mathbb{D}_\text{KL}(\pi_\theta \| \pi_\text{ref}) \right) \right] \tag{1}

The KL term uses an unbiased estimator (not a per-token approximation):

\mathbb{D}_\text{KL}(\pi_\theta \| \pi_\text{ref}) = \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1 \tag{2}

The advantage is the normalized group reward:

A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \ldots, r_G\})}{\text{std}(\{r_1, r_2, \ldots, r_G\})} \tag{3}

2.3 Why does this work? Intuition behind group normalization

Think about a student trying to improve at math. Rather than asking a teacher to score each solution on an absolute scale (which is what a learned value model approximates), the student compares their multiple draft solutions to each other: “which of my attempts was best?” This relative ranking is cheap to compute and surprisingly informative.

If all G outputs get reward 0 (all wrong), then all $A_i = 0$ and the policy doesn’t update — gradient is zero. If one output gets reward 1 and the rest get 0, then that output gets a large positive advantage and the others get a small negative advantage. The policy learns to make outputs more like the successful one. This is exactly the right inductive bias for reasoning tasks where most attempts fail.

Why GRPO’s KL estimator beats PPO’s per-token penalty for long CoT:

In PPO, the per-token KL penalty contributes to every token’s reward. A response of length $L$ accumulates $L \times \beta_\text{PPO} \times \text{KL}$ worth of penalty. Since longer reasoning chains are longer by definition, PPO implicitly penalizes the model for thinking more. GRPO’s KL is computed once per sequence and subtracted from the objective, so it doesn’t scale with length. This allows the model to freely increase response length during training.

Figure 1 compares PPO and GRPO architecturally:

graph LR
    subgraph PPO ["PPO (Requires Value Model)"]
        A1[Question q] --> B1[Policy Model]
        B1 --> C1[Output o]
        C1 --> D1[Reward Model → r]
        D1 --> E1[Value Model → v]
        E1 --> F1[GAE → Advantage]
        F1 --> G1[Clip + Update]
        A1 --> H1[Reference Model → KL per token]
        H1 --> G1
    end

    subgraph GRPO ["GRPO (No Value Model)"]
        A2[Question q] --> B2[Policy Model]
        B2 --> C2["G=16 Outputs {o_1...o_G}"]
        C2 --> D2["Rewards {r_1...r_G}"]
        D2 --> E2["Group Normalization → {A_i}"]
        E2 --> F2[Clip + KL + Update]
        A2 --> G2[Reference Model → KL sequence]
        G2 --> F2
    end

    style GRPO fill:#e8f5e9,stroke:#43a047
    style PPO fill:#fff3e0,stroke:#fb8c00

The key savings: GRPO eliminates the value model (saving ~50% memory and compute), avoids GAE hyperparameter sensitivity (PPO is highly sensitive to λ in GAE), and places KL regularization at the sequence level (avoiding length penalty).

3. DeepSeek-R1-Zero: Reasoning from Pure RL

3.1 Setup

DeepSeek-R1-Zero starts from DeepSeek-V3-Base and applies GRPO with zero SFT. The training prompt is minimal:

“A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags.”

No example reasoning trajectories. No templates showing how to reflect or verify. Just: here’s a question, show your thinking, give your answer.

3.2 Reward Design

The reward combines two rule-based components:

R_\text{rule} = R_\text{acc} + R_\text{format} \tag{4}

Accuracy reward ( $R_\text{acc}$ ): Binary (0 or 1). For math, the predicted answer is compared symbolically to the ground truth using pattern matching (the model is required to put the answer inside a box). For code, the solution is compiled and run against hidden test cases.

Format reward ( $R_\text{format}$ ): Small positive reward for correctly enclosing reasoning inside <think>...</think> tags.

The critical design choice: no neural reward model. Neural reward models are susceptible to reward hacking — the policy learns to produce outputs that score highly on the RM without actually being correct. Rule-based rewards are objective, deterministic, and cannot be hacked (a wrong answer is wrong, period).

RL data composition:

Math: 26K quantitative reasoning questions (algebra, calculus, Olympiads)
Code: 17K algorithm competition + 8K bug-fixing problems
STEM: 22K multiple-choice science questions
Logic: 15K deductive reasoning puzzles
General: 66K helpfulness/harmlessness questions

3.3 Emergent Reasoning Behaviors

The most striking finding of the paper: without any explicit instruction to reflect, verify, or explore alternatives, the model develops these behaviors organically.

Figure 2: AIME accuracy and response length during R1-Zero training

AIME 2024 Accuracy (Pass@1)        Average Response Length (tokens)
0.9 ┤                               20000 ┤
0.8 ┤                  ████          17500 ┤
0.7 ┤              ███                15000 ┤
0.6 ┤          ███                    12500 ┤
0.5 ┤       ██                        10000 ┤       ██████████
0.4 ┤     ██                           7500 ┤   ████
0.3 ┤   ██                             5000 ┤ ██
0.2 ┤ ██                               2500 ┤█
    0─────────────────── steps          0──────────────── steps
      0   2K  4K  6K  8K  10K            0  2K  4K  6K  8K  10K
   Initial AIME: 15.6%                 ~5K tokens initially
   Final: 77.9% (cons@16: 86.7%)      ~17K tokens at step 10K

Both curves show a sharp jump at step 8.2K. This corresponds to when the maximum output length was increased from 32,768 to 65,536 tokens. The model immediately took advantage of the longer context to think more deeply, and performance jumped accordingly — a striking demonstration that thinking time directly correlates with test-time compute.

The famous “aha moment” emerged around this same transition: the model began generating the word “Wait” mid-solution, then reconsidering its approach. Here is an actual excerpt from an intermediate checkpoint:

”…I could square both sides again, treating the equation… Wait, wait. Wait. That’s an aha moment I can flag here. Let’s reevaluate this step-by-step…”

This anthropomorphic self-correction pattern emerged without any training signal that rewards it directly. It arises because self-correction is correlated with correct final answers, and GRPO rewards correct final answers.

3.4 R1-Zero Results

On AIME 2024, DeepSeek-R1-Zero achieves 77.9% pass@1 (86.7% with self-consistency decoding over 16 outputs), surpassing the average performance of human math competition participants. This is comparable to OpenAI’s o1-mini.

However, R1-Zero has serious usability problems:

Language mixing: sometimes switches between Chinese and English mid-reasoning
Readability: the raw CoT is sometimes incoherent or repetitive
Narrow capability: pure reasoning RL doesn’t improve instruction following, writing, or open-domain QA

This motivates the four-stage pipeline for DeepSeek-R1.

4. DeepSeek-R1: The Full Multi-Stage Pipeline

flowchart TD
    A[DeepSeek-V3-Base] --> B["Stage 0: Cold Start SFT\n(thousands of long CoT examples\nhuman-aligned style)"]
    B --> C[R1-Dev1]
    C --> D["Stage 1: First RL\n(reasoning prompts + language\nconsistency reward)"]
    D --> E[R1-Dev2]
    E --> F["Stage 2: Rejection Sampling + SFT\n(800K samples: reasoning + general)"]
    F --> G[R1-Dev3]
    G --> H["Stage 3: Second RL\n(diverse prompts: reasoning + general\nrule + model-based reward)"]
    H --> I[DeepSeek-R1]

    J[DeepSeek-R1] --> K["Distillation (800K samples SFT)"]
    K --> L[Qwen-1.5B / 7B / 14B / 32B]
    K --> M[Llama-8B / 70B]

    style A fill:#e3f2fd
    style I fill:#e8f5e9,stroke:#43a047,stroke-width:2px
    style L fill:#fff9c4
    style M fill:#fff9c4

4.1 Stage 0: Cold Start — Teaching the Format

The problem with starting from pure R1-Zero’s style is readability. The fix: collect a small dataset (thousands of examples) of long CoT reasoning chains that have been manually rewritten into a conversational, first-person style.

The cold start data creation pipeline:

Generate multiple reasoning trajectories using R1-Zero at temperature 1.0
Filter to retain only those with correct final answers and acceptable formatting
Prompt DeepSeek-V3 to rewrite the accepted traces in a “natural, first-person, human-conversational style”
Human annotators verify the quality and language consistency

The key prompt for this rewriting step is:

Based on the above thought process, provide a clear, easy-to-follow, and
well-formatted solution. Show key steps in LaTeX. Use \boxed{} for final answers.
Do not add reasoning steps not in the original.

The model then fine-tuned on this cold-start data becomes R1-Dev1, which has much better readability but slightly degraded math performance (because cold start data is small and potentially suboptimal).

4.2 Stage 1: First RL — Reasoning with Language Consistency

The first RL stage trains R1-Dev1 with GRPO using the same rule-based rewards as R1-Zero, plus a new language consistency reward:

R_\text{language} = \frac{\text{Num}(\text{Words}_\text{target})}{\text{Num}(\text{Words})} \tag{5}

This is simply the fraction of words in the CoT that belong to the target language (Chinese or English, depending on the query). Adding this reward slightly degrades raw math performance (the language constraint reduces solution diversity), but makes the model far more readable.

The combined first-stage reward is:

R_\text{Stage1} = R_\text{rule} + R_\text{language} = R_\text{acc} + R_\text{format} + R_\text{language} \tag{6}

Training hyperparameters: lr = 3e-6, KL coefficient β = 0.001, clip ratio ε = 10. The unusually large ε (vs. the typical 0.2 in PPO) is deliberate — a lower ε truncates gradients for many tokens, degrading performance, while a higher ε prevents instability during reasoning training.

4.3 Stage 2: Rejection Sampling + Full SFT

After Stage 1, the model (R1-Dev2) is used to generate reasoning traces for a large-scale supervised training dataset. The process:

Algorithm 2: Rejection Sampling SFT Data Construction

Input: R1-Dev2 model, training questions Q_reasoning, Q_general

For math/code/STEM questions in Q_reasoning:
    1. Generate 16 outputs per question at temperature 1.0
    2. Filter: keep only correct answers (verified by rule-based judge)
    3. Keep: up to 3 distinct correct solutions per question
    4. Format cleanup: remove repetitions, fix language mixing

For general queries in Q_general:
    1. Generate responses using R1-Dev2
    2. Filter via DeepSeek-V3 preference model
    3. Keep top-ranked responses

Combined SFT dataset statistics (Table 5 from paper):
    Math:    395,285 samples, avg 6,094 tokens
    Code:    211,129 samples, avg 7,436 tokens
    STEM:     10,124 samples, avg 4,929 tokens
    Logic:    10,395 samples, avg 2,739 tokens
    General: 177,812 samples, avg 1,420 tokens
    TOTAL:   804,745 samples, avg 5,355 tokens

This is the most data-intensive stage: 800K samples averaging 5,355 tokens each ≈ 4.3 billion tokens of SFT data. Fine-tuning runs for 2-3 epochs with cosine decay learning rate (initial lr = 5×10⁻⁵, final = 5×10⁻⁶), max sequence length 32,768.

The resulting model (R1-Dev3) shows strong improvements in both reasoning and general capabilities compared to R1-Dev2.

4.4 Stage 3: Second RL — Helpfulness and Harmlessness

The final stage uses GRPO with mixed reward signals on diverse prompts:

R_\text{Stage2} = R_\text{reasoning} + R_\text{general} + R_\text{language} \tag{7}

R_\text{reasoning} = R_\text{rule} \quad\text{(for math/code/STEM)} \tag{8}

R_\text{general} = R_\text{RM\_helpful} + R_\text{RM\_safe} \tag{9}

The helpfulness reward model is trained on 66,000 preference pairs generated via DeepSeek-V3 with an arena-hard prompt format. The safety reward model is trained on 106,000 (safe/unsafe) labeled prompt-response pairs.

Reward hacking is real: the paper explicitly documents (Figure 6) that if you continue training with model-based rewards for too long, the policy learns to exploit weaknesses in the reward model (reward score rises while actual performance on CodeForces falls). Their mitigation: introduce general data and model-based rewards only in the final 400 steps of the 1,700-step second RL stage.

Second RL hyperparameters: temperature reduced to 0.7 (higher temperatures cause incoherent generation in this stage), otherwise same as Stage 1.

5. RL Infrastructure: Making 671B RL Feasible

Training a 671B parameter model with RL is an engineering challenge. The paper describes a four-module decoupled architecture:

flowchart LR
    subgraph Rollout ["🎲 Rollout Module"]
        R1[Load prompts]
        R2["vLLM workers\n(actor model)"]
        R3["8192 outputs\n→ 16 mini-batches"]
        R1 --> R2 --> R3
    end

    subgraph Inference ["🔍 Inference Module"]
        I1[Reward model\nforward pass]
        I2[Reference model\nKL computation]
    end

    subgraph RuleReward ["⚙️ Rule-Based Reward Module"]
        RR1[Code executor]
        RR2[Answer matcher]
        RR3[Format checker]
    end

    subgraph Training ["🏋️ Training Module"]
        T1[Actor model\n+ optional critic]
        T2[Compute loss\nupdate params]
    end

    Rollout -->|"outputs"| Inference
    Rollout -->|"outputs"| RuleReward
    Inference -->|"model rewards + KL"| Training
    RuleReward -.->|"async overlap"| Inference
    RuleReward -->|"rule rewards"| Training
    Training -->|"updated weights"| Rollout

    style Rollout fill:#e3f2fd
    style Training fill:#e8f5e9
    style RuleReward fill:#fff3e0

Key engineering decisions:

Expert parallelism for MoE: DeepSeek-V3-Base uses Mixture-of-Experts. During rollout, experts are parallelized across nodes to reduce memory access overhead; hotspot experts have redundant copies to balance compute.
Multi-Token Prediction (MTP) for self-speculative decoding: MTP predicts multiple future tokens simultaneously. During RL rollout, this acts as speculative decoding, dramatically accelerating the generation of the longest samples.
VRAM offloading between phases: Each module offloads model weights to system memory or disk after completing its phase, freeing VRAM for the next module. This enables running rollout (vLLM) and training on the same GPU cluster without doubling memory.
Data packing strategy: Sort all sequences by length, distribute across DP ranks, then use Best-Fit bin packing within each process to minimize padding. Ensures equal chunk counts across all processes for balanced training.
DualPipe algorithm: Efficient pipeline parallelism from DeepSeek-V3, allowing overlapped compute and communication.

Training cost (Table 7):

DeepSeek-R1-Zero: 64 × 8 H800 GPUs, ~198 GPU-hours per step, ~101,000 total GPU-hours ≈ $202K at$ 2/GPU-hr
DeepSeek-R1: Same GPU cluster, ~80 hours additional for RL stages ≈ additional ~$160K
SFT data creation: ~5,000 GPU-hours ≈ $10K

6. Distillation: Transferring Reasoning to Small Models

A surprising result: the complex reasoning patterns learned by DeepSeek-R1 can be transferred to much smaller models through simple SFT on the R1-generated 800K dataset.

Distilled model family (Table 6):

Distilled Model	Base Model	Init LR
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	1×10⁻⁴
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	8×10⁻⁵
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	7×10⁻⁵
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	6×10⁻⁵
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	5×10⁻⁵
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	2×10⁻⁵

Each model is fine-tuned for 2-3 epochs, cosine LR decay to 1/10 of initial, max sequence length 32,768 tokens, batch size 64.

graph TB
    subgraph Teacher ["Teacher: DeepSeek-R1 (671B)"]
        T1["800K long-CoT SFT samples\n(math, code, STEM, logic, general)"]
    end
    subgraph Students ["Student Models (SFT only, no RL)"]
        S1["Qwen-1.5B\n(AIME: ~28%)"]
        S2["Qwen-7B\n(AIME: ~55%)"]
        S3["Qwen-14B\n(AIME: ~70%)"]
        S4["Qwen-32B\n(AIME: ~72%)"]
        S5["Llama-8B\n(AIME: ~50%)"]
        S6["Llama-70B\n(AIME: ~70%)"]
    end
    T1 --> S1
    T1 --> S2
    T1 --> S3
    T1 --> S4
    T1 --> S5
    T1 --> S6
    style Teacher fill:#e8f5e9,stroke:#43a047
    style Students fill:#e3f2fd

Why does this work? The 800K SFT dataset is dense with high-quality long reasoning chains — better reasoning data than these smaller base models have ever seen. The distilled models learn to imitate the reasoning style (long CoT, self-verification, step-by-step decomposition) even though they don’t go through any RL themselves.

The 1.5B distilled model still achieves ~28% on AIME 2024, which would have been considered frontier-level reasoning for a small model just one year prior.

7. Experiments and Key Results

7.1 Stage-by-Stage Ablation

The paper provides a detailed breakdown of how performance evolves across the four pipeline stages (Table 3). Key observations:

Cold start SFT hurts math but helps usability. Comparing R1-Zero (pure RL) to R1-Dev1 (cold start SFT + first RL): AIME drops from 77.9% to 59.0% pass@1, but IF-Eval improves from 46.6% to 71.7% and AlpacaEval from 24.7% to 50.1%. Cold start data introduces a knowledge trade-off: better instruction following, worse raw math.

Reasoning RL (Stage 1) recovers math performance. R1-Dev2 recovers AIME to 74.0%, surpasses R1-Zero on LiveCodeBench (63.5% vs 50.0%) and GPQA Diamond (70.7% vs 75.8%? actually Dev2 67.0%). The reasoning RL stage primarily benefits verifiable tasks while leaving general instruction following roughly unchanged.

Full SFT (Stage 2) substantially improves general capabilities. R1-Dev3 shows dramatic improvements on Aider-Polyglot (coding: 44.8% vs 25.6% for Dev2) and AlpacaEval (62.1% vs 55.8%), driven by the large-scale general-purpose SFT data.

Final RL (Stage 3) polishes instruction following. Final DeepSeek-R1 vs R1-Dev3: AlpacaEval improves ~25% (87.6% vs 62.1%), ArenaHard improves ~17% (92.3% vs 75.6%). Math and code benchmarks see only marginal improvements since most reasoning-specific RL happened in earlier stages.

7.2 Comparison with Frontier Models

Figure 3: DeepSeek-R1 vs. frontier models on key benchmarks

Benchmark          DeepSeek-R1   OpenAI o1-1217  DeepSeek-V3   Claude 3.5
───────────────────────────────────────────────────────────────────────────
AIME 2024 (P@1)      79.8%           79.2%          39.2%          16.0%
MATH-500 (P@1)       97.3%           96.4%          90.2%          78.3%
LiveCodeBench        65.9%           63.4%          40.5%          36.3%
Codeforces Pctile    96.3%           96.6%          58.7%          20.3%
GPQA Diamond         71.5%           75.7%          59.1%          65.0%
MMLU                 90.8%           91.8%          88.5%          88.3%
IFEval               83.3%           92.8%          87.1%          88.0%
AlpacaEval 2.0       87.6%            —             70.0%          52.0%
───────────────────────────────────────────────────────────────────────────

DeepSeek-R1 essentially matches OpenAI o1 on math and code reasoning benchmarks. It falls short on instruction following (IFEval: 83.3% vs 92.8%), which the authors attribute to the cold start SFT being biased toward reasoning-style responses.

7.3 The “Aha Moment” Phenomenon

Figure 1(b) in the paper shows that average response length grows monotonically during R1-Zero training, tracking closely with the AIME performance curve. The model learns to think longer on harder problems — it allocates compute dynamically based on difficulty.

This is conceptually similar to test-time compute scaling (e.g., Best-of-N sampling, MCTS), but instead of external scaffolding, the model itself learns when to spend more thinking tokens. The key difference: this behavior is learned from reward signal alone, not engineered.

8. Limitations and Boundary Conditions

8.1 Limitations Acknowledged by the Authors

Language mixing. DeepSeek-R1 is optimized for Chinese and English. Queries in other languages may trigger English or Chinese reasoning internally. Root cause: DeepSeek-V3-Base is predominantly Chinese and English.

Structured output and tool use. R1’s outputs are verbose reasoning chains. For structured outputs (JSON, code with specific interfaces) or tool calls (search, calculator), R1 underperforms compared to instruction-following models. The authors note this is solvable with targeted RL but wasn’t done in this release.

Prompt sensitivity. DeepSeek-R1 performs best in a zero-shot setting. Adding few-shot examples consistently degrades performance — the model’s CoT reasoning style is disrupted by in-context examples. This is opposite to conventional LLMs where few-shot helps.

Token efficiency / overthinking. R1 sometimes generates unnecessarily long CoT on simple questions. The dynamic allocation is learned from training data distribution; if training has few simple questions, the model doesn’t learn to be concise on them.

Software engineering tasks. R1 shows only marginal improvement over DeepSeek-V3 on SWE-Bench (49.2% vs 48.8%). RL for software engineering requires running full test suites, which is slow and difficult to integrate into the RL training loop.

8.2 The Reward Hacking Limit

The most fundamental limitation of pure RL reasoning is reward hackability. Rule-based rewards (math answer matching, code test cases) are reliable because they’re deterministic. But for tasks that require more nuanced evaluation — writing quality, code elegance, factual accuracy in open-domain QA — you need a neural reward model, which can be exploited.

The paper documents this directly: when training with model-based rewards for too many steps, the policy learns to produce outputs that score highly on the reward model but actually perform worse on real metrics (Figure 6). Their mitigation (only 400 steps with model-based rewards) is a partial fix, but it means the second RL stage cannot do heavy optimization on general tasks.

9. Critical Assessment: Weaknesses & Improvements

W1: Benchmarking Validity — AIME Contamination Risk

The AIME 2024 benchmark is a set of 30 problems from a specific competition. The DeepSeek-V3-Base pre-training data includes math competitions up to some cutoff. The paper includes a data contamination check but acknowledges uncertainty. An AIME pass@1 of 79.8% is extraordinary — achieving this on genuinely held-out math olympiad problems, rather than problems the base model has seen variants of, is not fully demonstrated.

The paper does not compare against fresh post-training-cutoff competitions (e.g., AIME 2025, IMO 2025 shortlist problems). This is the most important missing experiment for evaluating true generalization of the reasoning capabilities.

W2: Cold Start Data Not Released — Reproducibility Gap

The cold start SFT dataset (thousands of human-rewritten CoT examples) is not publicly released. Without this data, the full DeepSeek-R1 pipeline cannot be reproduced. The paper is forthright that the multi-stage pipeline is “product-driven,” but a truly open recipe would require this data.

The distilled models and the 800K SFT rejection-sampled data are also not released — only the final model weights. Researchers cannot independently study whether the distillation training data quality (rather than scale or base model capability) is the key driver.

W3: Reward Hacking Mitigation is Empirical, Not Principled

The paper documents reward hacking with model-based rewards and mitigates it by reducing the number of training steps. This is essentially “stop before it gets bad.” A more principled approach would be to use uncertainty quantification in the reward model, ensemble reward models with different inductive biases, or develop process-level reward models that evaluate reasoning quality step-by-step (process reward models, PRMs). The paper does not compare against these alternatives.

W4: Capability Gap on Non-Reasoning Tasks is Not Fully Analyzed

IFEval (instruction following) shows DeepSeek-R1 at 83.3% vs o1’s 92.8% — a meaningful gap of nearly 10 percentage points. SimpleQA (factual accuracy on atomic facts) shows R1 at 30.1% vs o1 at 47.0% — a larger gap. The paper attributes this to the reasoning-optimized training pipeline but doesn’t provide ablations showing which stage is responsible or propose a path to closing this gap.

L1: Overthinking on Simple Questions Is Understated

The paper mentions “overthinking” as a limitation but does not quantify it. A model that generates 10,000 tokens to answer “What is 2+2?” has real deployment costs (latency, API pricing, hardware utilization). The distilled small models inherit this behavior and may be particularly prone to it due to imitating R1’s long-CoT style on all queries. A length penalty calibrated to question difficulty (e.g., using question difficulty estimation) was not explored.

L2: Multi-Turn Reasoning Not Addressed

Table 5 shows that 99.7% of the SFT data is single-turn (Avg Rounds ≈ 1.0). Real-world reasoning tasks often require multi-turn interaction — asking clarifying questions, receiving partial information, iterating. DeepSeek-R1’s CoT is impressive for single-shot problems but may degrade significantly on tasks requiring conversational context.

I1: Process Reward Models for Verification Quality

The self-reflection (“Wait, wait. Wait.”) behavior emerges spontaneously, but it’s not clear when self-reflection is productive vs. a hallucinated correction of a correct earlier step. Process Reward Models (PRMs) that score individual reasoning steps could replace the current approach: train the RL policy to not just produce correct final answers, but correct intermediate steps. This would make the verification more principled and might reduce overthinking.

I2: Decoupling Language Consistency from Reasoning Quality

The language consistency reward (Eq. 5) penalizes all cross-language words uniformly. But code, mathematical notation, and technical terms are language-agnostic — penalizing a Chinese model for writing def compute_loss(...) in a Python solution is counterproductive. A more nuanced LC reward could be token-type-aware, only penalizing cross-language natural language words while ignoring code, LaTeX, and standard technical vocabulary.

I3: Systematic Evaluation of Distillation Recipes

The paper shows that SFT on 800K R1-generated samples is sufficient to distill reasoning into 1.5B–70B models. But is 800K optimal? Is the data mix optimal? Would mixing R1-generated data with base SFT data improve general capability at modest reasoning cost? These scaling law experiments for distillation would be highly valuable for the community and are missing from the paper.

10. Conclusion

DeepSeek-R1 is a landmark result that reshapes our understanding of how reasoning emerges in LLMs. The core finding — that self-reflection, verification, and dynamic compute allocation can arise from outcome-based RL alone — fundamentally challenges the assumption that human-labeled reasoning trajectories are necessary.

The GRPO algorithm makes this tractable by eliminating the value model (halving the memory and compute overhead of RL) and replacing per-token KL penalties with a sequence-level formulation that doesn’t penalize response length. The four-stage pipeline (cold start → RL → rejection sampling SFT → RL) elegantly separates reasoning capability acquisition (Stages 1-2) from usability alignment (Stages 3-4).

The distillation results are perhaps equally important: the fact that a 1.5B model fine-tuned on R1’s reasoning traces achieves frontier-level math performance suggests that the bottleneck is training data quality and format, not model capacity alone for this class of tasks.

The open questions are significant: Can this approach extend to tasks without reliable verifiers (writing, factual QA, complex planning)? Can the overthinking behavior be tamed without sacrificing reasoning quality? Can the pipeline be made fully reproducible? These questions will drive the next chapter of reasoning model research.

For practitioners, the immediate takeaway is actionable: if you have a verifiable task, a strong base model, and compute to burn, pure RL may outperform carefully curated SFT for reasoning improvement. The age of needing to explain every reasoning step to the model may be over.

Reproducibility Notes

Model weights: DeepSeek-R1 series (all 6 distilled models + R1 + R1-Zero) are available at https://huggingface.co/deepseek-ai
Base model: DeepSeek-V3-Base (671B, 37B active, MoE). Pre-training data: 14.8T tokens, predominantly Chinese and English web + e-books.
GRPO implementation: Open-source implementations exist in OpenRLHF, verl, TRL. Key hyperparams: G=16, ε=10, β=0.001, lr=3e-6, max output length 32K→64K.
Data: The 800K rejection-sampled SFT dataset is not publicly released. Community approximations exist (e.g., OpenThoughts, Sky-T1 datasets).
Compute: R1-Zero training: 64×8 H800 GPUs, 198 GPU-hours (~ $200K). Full R1 including all stages: estimated >$ 400K.
Reproducibility gap: Without the cold start data and the exact reward model details, full reproduction of the DeepSeek-R1 pipeline is not possible from the paper alone. Partial reproduction (R1-Zero style training) is well-documented and has been successfully replicated by the community (e.g., Sky-T1, OpenR1, STILL-3).

Appendix E: Detailed Data Recipe and Hyperparameter Summary

For completeness, here is a consolidated reference of the key hyperparameters across all training stages:

Stage 0 — Cold Start SFT:

Parameter	Value
Base model	DeepSeek-V3-Base (671B, 37B active, MoE)
Dataset size	~thousands of curated examples
Learning rate	cosine decay, 5×10⁻⁵ → 5×10⁻⁶
Max sequence length	32,768 tokens
Batch size	128
Epochs	2-3

Stage 1 — First RL (GRPO on reasoning prompts):

Parameter	Value
Algorithm	GRPO
Learning rate	3e-6
KL coefficient β	0.001
Clip ratio ε	10 (unusually large!)
Group size G	16
Sampling temperature	1.0
Max output length	32,768 → 65,536 tokens (at step 8.2K)
Batch size	32 questions × 16 outputs = 512 per step
Reference model refresh	Every 400 steps
Total steps	~10,400 (R1-Zero); similar for R1 Stage 1

Stage 2 — Rejection Sampling + SFT:

Parameter	Value
Dataset size	~800K samples
Average tokens per sample	~5,355
Learning rate	cosine decay, 5×10⁻⁵ → 5×10⁻⁶
Max sequence length	32,768 tokens
Batch size	128
Epochs	2-3

Stage 3 — Second RL (GRPO on diverse prompts):

Parameter	Value
Algorithm	GRPO
Learning rate	3e-6
KL coefficient β	0.001
Sampling temperature	0.7 (reduced from Stage 1!)
Total steps	1,700
Model-based reward intro	Final 400 steps only

Distillation:

Parameter	Value
Dataset	800K R1-generated rejection-sampled SFT data
Learning rate	model-dependent (1×10⁻⁴ for 1.5B, 2×10⁻⁵ for 70B)
LR schedule	cosine decay to 1/10 of initial
Max sequence length	32,768 tokens
Batch size	64
Epochs	2-3

The large clip ratio $\varepsilon = 10$ in GRPO deserves special attention. In standard PPO implementations, $\varepsilon \approx 0.1$ - $0.2$ . A small ε aggressively clips the policy ratio, preventing large parameter updates but also truncating gradients for many tokens in long reasoning sequences. For a response of 10,000 tokens, a small ε means that the vast majority of tokens have their gradient zeroed out — the policy cannot learn effectively from long responses. The DeepSeek team found that ε = 10 is necessary for reasoning training with long outputs, essentially removing the clipping constraint for practical purposes while relying on the KL term for stability.

Appendix A: GRPO vs PPO — Deeper Mathematical Comparison

This appendix walks through the key mathematical differences between GRPO and PPO in more detail, since understanding them is essential for understanding why GRPO enables long CoT training.

A.1 PPO Objective and GAE

In standard PPO, the policy gradient objective is:

\mathcal{J}_\text{PPO}(\theta) = \mathbb{E}_{t} \left[ \min\left( r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t \right) - \beta_\text{PPO} \cdot \text{KL}_t \right] \tag{A1}

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)$ is the importance sampling ratio at token position $t$ , and $\text{KL}_t$ is the per-token KL penalty added as a dense reward.

The advantage $\hat{A}_t$ is computed via GAE:

\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \tag{A2}

where $V(s)$ is the value function (the Critic), $\gamma$ is the discount factor, $\lambda \in [0,1]$ is the GAE trace-decay parameter, and $\delta_t$ is the one-step TD error.

The sensitivity to $\lambda$ is significant: the paper shows experimentally that with $\lambda = 0.95$ (default in most PPO implementations), PPO performs considerably worse than GRPO. Only with careful tuning to $\lambda = 1.0$ does PPO match GRPO’s performance — but this makes the advantage estimate equivalent to Monte-Carlo returns, eliminating the bias-variance tradeoff that GAE was designed for.

A.2 GRPO: Value-Free Advantage Estimation

GRPO replaces the entire value model with group-relative normalization:

A_i = \frac{r_i - \mu_r}{\sigma_r}, \quad \mu_r = \frac{1}{G}\sum_{j=1}^G r_j, \quad \sigma_r = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \mu_r)^2} \tag{A3}

This has several nice properties:

Scale invariance: The normalized advantage is insensitive to the absolute scale of rewards.
Group competition: Within each group, the “best” outputs always get positive advantage and the “worst” always get negative advantage — even if all outputs are correct (all $r_i = 1$ ), the group mean is 1 and the standard deviation is 0, so all advantages are 0 (no gradient). The policy doesn’t waste gradient on problems it’s already solved.
Memory efficiency: No value model, no optimizer states for the Critic, ~50% parameter count reduction.

A.3 The KL Divergence Formulation Difference

PPO adds KL as a per-token penalty:

\text{PPO reward at token } t: \quad r_t + \left(-\beta_\text{PPO} \cdot \log \frac{\pi_\theta(a_t|s_t)}{\pi_\text{ref}(a_t|s_t)}\right) \tag{A4}

Since RL maximizes cumulative reward $\sum_t r_t$ , and each token contributes $-\beta_\text{PPO} \cdot \text{KL}_t$ to the cumulative reward, a response of length $L$ accumulates total KL cost:

\text{Total KL cost (PPO)} = \beta_\text{PPO} \cdot \sum_{t=1}^{L} \text{KL}_t \approx \beta_\text{PPO} \cdot L \cdot \overline{\text{KL}} \tag{A5}

This grows linearly with response length $L$ . Longer responses have higher KL cost, which PPO implicitly penalizes.

GRPO’s KL is computed at the sequence level using an unbiased estimator of the reverse KL:

D_\text{KL}(\pi_\theta \| \pi_\text{ref}) \approx \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_\text{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1 \tag{A6}

This expression (derived from the approximation $\mathbb{E}_{\pi_\theta}[\log(\pi_\theta/\pi_\text{ref})] \approx \mathbb{E}_{\pi_\text{ref}}[\pi_\text{ref}/\pi_\theta - \log(\pi_\text{ref}/\pi_\theta) - 1]$ ) is added once per sequence to the GRPO objective. It does not scale with sequence length, allowing the model to freely generate long reasoning chains without accumulating additional regularization cost.

A.4 Periodically Refreshing the Reference Policy

A subtlety in DeepSeek-R1’s GRPO training: the reference policy $\pi_\text{ref}$ is updated every 400 steps to match the current policy $\pi_\theta$ . This has a crucial effect: rather than enforcing that the trained policy stays close to the original pre-trained model forever, it only requires that the policy doesn’t change too drastically in any single 400-step window.

Over thousands of training steps, the policy can drift far from the initial checkpoint — exactly what you want for reasoning tasks where the pre-trained policy has poor AIME performance. Keeping a static reference would gradually strangle the RL signal as the policy diverges.

Appendix B: Cold Start Data Pipeline — Step-by-Step

The cold start SFT data is arguably the most overlooked component of the DeepSeek-R1 pipeline, yet it’s what transforms R1-Zero’s powerful but unreadable reasoning into the polished R1 style. Here’s the detailed pipeline:

Step 1: Seed question collection. Gather thousands of high-quality, diverse reasoning prompts from math competitions (AMC, AIME, AMO), programming contests (Codeforces, AtCoder), and STEM problems.

Step 2: High-temperature rollout from R1-Zero. For each prompt, generate 10-20 reasoning trajectories at temperature 1.0. High temperature encourages diverse reasoning strategies, increasing the chance that at least one trajectory discovers the correct approach.

Step 3: Filtering by correctness and format. Keep only trajectories where:

The final answer matches the ground truth (verified by symbolic math parser or code execution)
The response doesn’t contain excessive repetition (repetition detection filter)
The response is not language-mixed beyond a threshold (LC filter)

Step 4: Human rewriting for style. The accepted trajectories are given to human annotators who rewrite them to:

Use first-person perspective (“I notice that…”, “Let me reconsider…”)
Present reasoning in clear, coherent paragraphs
Format math in LaTeX, code in fenced blocks
Ensure language consistency throughout

Step 5: LLM-based generation expansion. The human-rewritten examples are used as prompts to a capable LLM (DeepSeek-V3), which generates additional examples in the same style. This multiplies the cold start data from hundreds to thousands.

Step 6: Human verification of LLM-generated examples. A second round of human QA checks the LLM-generated data for correctness, style consistency, and naturalness.

The result is a training set where every example shows: (a) a hard problem, (b) a long, self-reflective, first-person reasoning process, (c) explicit verification steps, and (d) a correct, boxed final answer. This is the “cold start” that teaches DeepSeek-R1 the fundamental communication contract with users.

Appendix C: Test-Time Compute Scaling in Reasoning Models

DeepSeek-R1’s dynamic thinking-time allocation connects to a broader research thread on test-time compute scaling. Understanding this connection is valuable for predicting where this line of research goes next.

The compute continuum:

Static inference                           Dynamic inference
─────────────────────────────────────────────────────────────►
  Single forward pass    →   Best-of-N   →   MCTS   →   R1-style CoT
  (greedy decode)            (parallel)    (sequential,    (sequential,
                                           tree search)    learned)
  Compute: 1x             Compute: Nx    Compute: O(D·B)  Compute: learned
  Overhead: 0             Overhead: N runs  Overhead: D depth  Overhead: training

Why R1’s approach is compelling from a compute perspective:

No orchestration overhead: Best-of-N requires N complete forward passes. MCTS requires building and navigating a tree. R1’s CoT happens in a single forward pass — the “thinking” is just generating tokens.
Adaptive allocation: R1 generates more tokens for harder questions automatically (as shown by the training curves). A simple question might get a 500-token response; an Olympiad problem might get a 10,000-token response. Best-of-N uses the same compute for all inputs.
The scaling law: As shown in Figure 1(b), average response length grows roughly linearly with training steps. This suggests that longer training directly translates to more thinking capacity — a remarkable property that implies compute invested in RL training pays dividends in inference quality.

The open question: At what problem difficulty level does R1-style CoT hit a ceiling that MCTS (with its principled tree search) would not? The paper doesn’t address this directly, but it’s arguably the most important question for the next generation of reasoning models.

Appendix D: Why This Changes the RL-for-LLM Calculus

Before DeepSeek-R1, the dominant narrative about RL for LLMs was:

RL is only useful after a well-initialized SFT model (RLHF paradigm)
Human preference data is essential for RL to work
The reward signal defines the upper bound of model capability

DeepSeek-R1-Zero disproves all three:

SFT is not necessary — RL alone can bootstrap reasoning from a base model
Human feedback (for the reasoning process) is replaceable by rule-based verifiers
The capability upper bound is set by the base model’s latent potential, which RL unlocks

The key insight is that DeepSeek-V3-Base already “knows” how to do math — it was pre-trained on 14.8T tokens including substantial math and code content. What RL does is find the elicitation protocol: the response format and strategy that allows the model’s latent knowledge to be expressed correctly. SFT with human reasoning traces gives one such protocol, but it’s constrained by human thought patterns. RL finds potentially better protocols autonomously.

This reframes reasoning model development: the bottleneck is no longer “how do I get humans to write correct reasoning traces?” but “do I have a base model with sufficient latent knowledge, and a reliable verifier for my task?” If yes, RL might be the better path.