June 9, 2026 EN #Reinforcement Learning #LLM Training #Reasoning

VAPO: Value-Augmented Proximal Policy Optimization for Long-CoT Reasoning

Review date: 2026-06-09 Review author: Zhongzhu Zhou Paper reviewed: VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks Paper authors: Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, et al. (ByteDance Seed) arXiv: 2504.05118 Status / Venue: arXiv preprint (April 11, 2025)

Short Answer

VAPO argues that value-model-based RL has a higher performance ceiling than value-model-free methods (GRPO, DAPO) for long-chain-of-thought reasoning, but only if three hard problems are solved: eliminating value-model bias from long bootstrapped trajectories, handling the wildly different bias-variance requirements of short vs. long responses, and overcoming sparse terminal rewards. VAPO solves all three with a set of carefully integrated techniques — the centrepiece being Length-adaptive GAE — and reaches 60.4 on AIME 2024 with Qwen2.5-32B, surpassing DAPO by 10+ points while being more stable and sample-efficient.

Prerequisites: What You Need to Know First

Before diving into VAPO’s mechanics, let me walk through the background concepts that the paper builds on. If you already know PPO and GRPO deeply, feel free to skim this section; but if RL for LLMs is new to you, read carefully — everything in VAPO makes more sense once these building blocks are clear.

The Reinforcement Learning Setup for LLMs

Training an LLM with RL is framed as a token-level Markov Decision Process (MDP). Here is what that means concretely:

State $s_t$ : the entire token sequence seen so far — prompt tokens $x_0, \dots, x_m$ plus generated tokens $y_0, \dots, y_t$ .
Action $a_t$ : the next token chosen from the vocabulary $\mathcal{A}$ .
Transition: deterministic — once you pick $a_t = y_{t+1}$ , the next state is just $s_t$ with that token appended.
Reward $R(s_t, a_t)$ : a scalar signal. In verifiable tasks (math, code), reward is sparse — typically 0 at every step and either +1 (correct) or 0 (wrong) only when the <eos> token is emitted.
Episode horizon $H$ : the total number of tokens in a response. For long-CoT reasoning this can be thousands of tokens, making episodes far longer than traditional RL environments.

The agent (the LLM policy $\pi_\theta$ ) must learn to maximize expected total reward while staying close to an initial reference policy $\pi_\text{ref}$ to prevent degenerate outputs. This is expressed as the KL-regularised objective:

\pi^* = \arg\max_{\pi} \, \mathbb{E}_{\pi, s_0 \sim d_0}\!\left[\sum_{t=0}^{H}\bigl(R(s_t, a_t) - \beta \, \mathrm{KL}(\pi(\cdot|s_t) \,\|\, \pi_\text{ref}(\cdot|s_t))\bigr)\right] \tag{1}

The $\beta$ coefficient tunes how aggressively the policy is allowed to stray from the reference. Small $\beta$ gives the model more freedom to explore; large $\beta$ keeps outputs closer to the supervised fine-tuned starting point.

Proximal Policy Optimization (PPO)

PPO is the dominant value-model-based RL algorithm for LLMs. Its core idea is: compute a per-token advantage estimate $\hat{A}_t$ (how much better is token $a_t$ compared to the “average” action from state $s_t$ ?), then update the policy to increase probability of high-advantage tokens — but clip the update ratio so a single gradient step can’t cause catastrophic policy collapse.

The clipped surrogate loss is:

\mathcal{L}^{\mathrm{CLIP}}(\theta) = \hat{\mathbb{E}}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \mathrm{clip}\!\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right)\hat{A}_t\right)\right] \tag{2}

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)$ is the probability ratio of the new vs. old policy, and $\epsilon$ (often 0.2) is the clip range.

Intuitively: if the advantage is positive (this was a good token), PPO encourages the policy to increase its probability — but caps how much probability is shifted in one step. If the advantage is negative (bad token), the policy is pushed to decrease probability, again with a cap. This symmetry ensures learning is gradual.

Generalized Advantage Estimation (GAE)

The advantage $\hat{A}_t$ tells the policy whether action $a_t$ was good or bad relative to the value function estimate $V(s_t)$ . Computing it requires two ingredients:

TD residual: $\delta_t = R(s_t, a_t) + \gamma V(s_{t+1}) - V(s_t)$ , the immediate Bellman error.
GAE with parameter $\lambda$ :

\hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma\lambda)^l \, \delta_{t+l} \tag{3}

Here $\gamma$ is the discount factor and $\lambda \in [0,1]$ controls the bias-variance tradeoff:

$\lambda = 0$ : one-step TD. Low variance but high bias (relies entirely on the value model, which may be wrong).
$\lambda = 1$ : full Monte Carlo return. No bias from the value model, but high variance (a single unlucky trajectory can dominate).

The value model $V_\phi(s_t)$ must be trained to predict the expected future return from state $s_t$ . This adds computational overhead — you need to store and update a second neural network alongside the policy.

Why GRPO Avoids the Value Model (And Why That’s Limiting)

GRPO (Group Relative Policy Optimization), introduced in the DeepSeekMath paper, sidesteps the value model entirely. For each prompt $x$ , GRPO samples $G$ responses $\{y_1, \dots, y_G\}$ and computes rewards $\{r_1, \dots, r_G\}$ by running a verifier. The group-normalised reward is used directly as the advantage:

\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G} \tag{4}

where $\mu_G$ and $\sigma_G$ are the mean and standard deviation of rewards within the group. This advantage is trajectory-level — the same value is assigned to every token in response $y_i$ .

GRPO’s appeal is simplicity: no value network, no bootstrapping instability, no TD residual. The group mean serves as a clean baseline. DAPO extended GRPO with clip-higher and token-level loss, pushing AIME 2024 scores to ~50 with Qwen2.5-32B.

The fundamental limitation: GRPO provides coarse credit assignment. Every token in the same response gets the same advantage, regardless of whether a reasoning step at token 150 was brilliant or whether the error crept in at token 800. If you can train a good value model, you can do much better by assigning advantages at token granularity — but that “if” is the crux of VAPO’s challenge.

Self-Imitation Learning (SIL)

Self-Imitation Learning (Oh et al., 2018) is an off-policy RL technique that stores historically successful trajectories in a replay buffer and periodically replays them alongside on-policy rollouts. The intuition: if the model once succeeded at a hard problem, it should re-internalise that solution as a positive anchor, not let it decay. In LLM training this means: keep a pool of high-reward responses, and periodically mix them into the training batch.

What the Paper Does

VAPO makes one central claim: value-model-based RL can and should beat value-model-free methods, provided you solve three specific engineering challenges. Rather than proposing a single radical new algorithm, VAPO is an integrated system that carefully identifies why vanilla PPO fails on long-CoT tasks and fixes each failure mode:

Challenge	Symptom in PPO	VAPO’s Fix
Value model bias	Value network diverges / underfits → wrong advantages	Value Pretraining + Decoupled-GAE
Heterogeneous sequence lengths	Short responses over-regularised; long ones under-regularised	Length-adaptive GAE (new, from VAPO)
Sparse rewards	Almost no learning signal until `<eos>`	Clip-Higher + SIL + Group-Sampling

The resulting system achieves 60.4 on AIME 2024 with Qwen2.5-32B — reaching this score within 5,000 gradient steps and with zero training crashes across multiple independent runs, compared to DAPO’s ~50 and vanilla PPO’s ~5.

flowchart TD
    A["Pre-trained LLM\n(Qwen2.5-32B)"] --> B["Value Pretraining\n(warm up value model\non SFT rollouts)"]
    B --> C["VAPO RL Loop"]
    C --> D["Group-Sampling\n(G responses/prompt)"]
    D --> E["Verifier Reward\n(binary correct/wrong)"]
    E --> F["Decoupled-GAE\nAdvantage Estimation\n(separate V-update)"]
    F --> G["Length-adaptive GAE\n(λ adapts to seq length)"]
    G --> H["Clip-Higher Loss\n(asymmetric ε clipping)"]
    H --> I["Token-level Policy Update"]
    I --> J{{"Self-Imitation Learning\n(replay top-k buffer)"}}
    J --> C
    C --> K["Policy Model\n(updated every step)"]
    K --> L["Evaluation\nAIME 2024 / AMC / MATH"]

Figure 1: VAPO System Architecture — Data flows clockwise from the pre-trained base model through value pretraining, into the online RL loop with group-sampling, verifier scoring, advantage estimation (Decoupled + Length-adaptive GAE), and policy update via clip-higher token-level loss. A self-imitation replay buffer injects high-reward past trajectories as additional training signal.

Challenge 1: Value Model Bias

The Problem in Depth

In standard PPO for LLMs, the value network is initialised from the same pre-trained weights as the policy and is updated end-to-end during RL. This causes two compounding problems in the long-CoT setting:

Bootstrapping instability: GAE computes advantages recursively using the value model’s current estimates. If $V_\phi$ is inaccurate early in training (which it inevitably is — the model has never seen RL rollouts before), the TD residuals $\delta_t$ are systematically wrong. Since each $\delta_t$ enters the advantage sum at all positions $< t$ , errors compound. A bad value estimate at token 1000 contaminates advantages for all earlier tokens.
Cold-start bias: The policy and value model share parameters. When RL begins, the value head is essentially random — it has been trained only on next-token prediction, not on predicting cumulative reward. Early gradient updates are therefore based on garbage advantage estimates, and the policy can be pushed in harmful directions before the value model stabilises.

To see why this matters numerically: suppose the true value at a state is 0.6 (meaning: 60% of trajectories that pass through here end in a correct answer). If the value model estimates 0.1, the advantage at all predecessor tokens will be artificially inflated (the model appears to be doing much better than expected), leading to over-enthusiastic policy updates and training instability.

VAPO’s Fix: Value Pretraining + Decoupled-GAE

VAPO adopts two techniques from VC-PPO (Value-Calibrated PPO):

Value Pretraining: Before RL begins, the value head is fine-tuned on rollouts from the SFT-warm-started policy. Concretely, rollouts are generated from $\pi_\text{SFT}$ , rewards are assigned by the verifier, and the value network is trained to minimise:

\mathcal{L}_V = \mathbb{E}_{(s_t, G_t)}\!\left[(V_\phi(s_t) - G_t)^2\right] \tag{5}

where $G_t = \sum_{l \geq t} \gamma^{l-t} R(s_l, a_l)$ is the Monte Carlo return from step $t$ . This gives the value model a warm-start on the distribution of states and rewards it will encounter during RL, substantially reducing cold-start bias.

Decoupled-GAE: Rather than updating the value network and the policy in the same backward pass, VAPO decouples them. The value network is updated first with fresh rollouts, and only then are the advantages computed using the updated value estimates before the policy update. This ensures policy gradients are computed with the best-available advantage estimates rather than the stale ones from the previous iteration.

Algorithm 1: Decoupled-GAE Update (Pseudocode)
────────────────────────────────────────────
Input:  Policy πθ, Value Vφ, Rollout buffer B
        Update intervals: K_v (value), K_π (policy)

1. Collect rollouts: for each prompt x in batch:
   a. Sample G responses {y_1,...,y_G} from πθ_old
   b. Get rewards {r_1,...,r_G} from verifier
   c. Store (x, y_i, r_i) in buffer B

2. VALUE UPDATE (K_v gradient steps):
   for step in range(K_v):
     Compute targets G_t for each (s_t, r_terminal) in B
     Minimise L_V = E[(V_φ(s_t) - G_t)^2]  [Eq. 5]
     Update φ ← φ - α_v * ∇_φ L_V

3. ADVANTAGE ESTIMATION:
   for each (s_t, a_t, r_t) in B:
     δ_t = r_t + γ·V_φ(s_{t+1}) - V_φ(s_t)    [TD residual]
     Â_t = Σ_{l≥0} (γλ)^l · δ_{t+l}            [GAE, Eq. 3]
     (λ chosen by Length-adaptive rule, Eq. 6)

4. POLICY UPDATE (K_π gradient steps):
   for step in range(K_π):
     r_t(θ) = π_θ(a_t|s_t) / π_θold(a_t|s_t)
     Minimise L_CLIP(θ) [Eq. 2, with Clip-Higher]
     Update θ ← θ - α_π * ∇_θ L_CLIP

5. Optional SIL step: replay top-k buffer trajectories
────────────────────────────────────────────

The key insight: by separating steps 2 and 4, the policy update in step 4 uses advantages computed from a freshly-updated value model ( $V_\phi$ after step 2), not the stale one from the previous RL iteration. This directly reduces the advantage bias caused by a lagging value model.

Challenge 2: Heterogeneous Sequence Lengths

The Problem in Depth

Long-CoT reasoning produces responses of wildly varying lengths. A simple arithmetic problem might be solved in 200 tokens; a hard competition-math problem might require 8000+ tokens of exploration, backtracking, and verification. VAPO’s training data contains both kinds.

The problem is that a single value of $\lambda$ in GAE works poorly across this range. Here is why:

For a short response (say, 200 tokens), bootstrapping is not too risky — even if $V_\phi$ is slightly off, there are few steps for errors to accumulate. A low $\lambda$ (e.g., 0.3) favours low-variance one-step estimates and works fine.

For a long response (say, 8000 tokens), the situation is reversed. With low $\lambda$ , the advantage estimate at token 1 is almost entirely determined by $V_\phi(s_1)$ , which must predict a reward that won’t arrive for 8000 tokens. If $V_\phi$ is biased (and it will be at long horizons), every single token gets a wrong advantage. A high $\lambda$ (e.g., 0.95) is better because it keeps summing TD residuals over many steps, reducing reliance on the single $V_\phi$ estimate, at the cost of higher variance (many residuals contribute).

With a fixed $\lambda$ , you’re forced to pick a compromise that’s suboptimal for both extremes. VAPO solves this by making $\lambda$ a function of response length.

VAPO’s Fix: Length-Adaptive GAE

Length-adaptive GAE computes $\lambda$ as a sigmoid function of the response length $L$ (in tokens):

\lambda(L) = \lambda_{\min} + (\lambda_{\max} - \lambda_{\min}) \cdot \sigma\!\left(\frac{L - \mu_L}{\sigma_L}\right) \tag{6}

where:

$\lambda_{\min}$ , $\lambda_{\max}$ are the minimum and maximum $\lambda$ values (e.g., 0.3 and 0.95),
$\sigma(\cdot)$ is the sigmoid function,
$\mu_L$ and $\sigma_L$ are the mean and standard deviation of response lengths in the current training batch (used to normalise $L$ ).

graph LR
    A["Response length L\n(tokens)"] --> B["Sigmoid normalisation\n(L - μ_L) / σ_L"]
    B --> C["λ(L) = λ_min + (λ_max - λ_min)·σ(...)"]
    C --> D["Short response\nL ≪ μ_L\n→ λ ≈ λ_min ≈ 0.3\n(low bias, low variance)"]
    C --> E["Long response\nL ≫ μ_L\n→ λ ≈ λ_max ≈ 0.95\n(less reliance on V_φ)"]

Figure 2: Length-Adaptive GAE Mapping — The sigmoid ensures a smooth, differentiable transition between short-response (low- $\lambda$ ) and long-response (high- $\lambda$ ) regimes, with the batch statistics providing automatic normalisation.

Why Sigmoid Specifically?

The sigmoid is a natural choice because:

It maps any real input to (0, 1), so $\lambda(L)$ is always bounded.
It is monotonically increasing, so longer responses always get higher $\lambda$ .
It is smooth, avoiding sharp transitions that could create discontinuities in the loss landscape.
The inflection point (halfway between $\lambda_\text{min}$ and $\lambda_\text{max}$ ) occurs at $L = \mu_L$ , meaning the midpoint is the average response length — a natural choice.

Alternative Considered: Fixed $\lambda$

The obvious alternative is a single $\lambda$ for all sequences, as done in DAPO/GRPO (which technically set $\lambda = 1$ implicitly by using full Monte Carlo returns). As the ablations show, switching from fixed to adaptive $\lambda$ accounts for a meaningful fraction of VAPO’s gains. The improvement is clearest on hard problems requiring very long chains — exactly where fixed $\lambda$ would be most severely mismatched.

Boundary Conditions

Length-adaptive GAE relies on two assumptions:

The value model is trained at all lengths. If $V_\phi$ is only trained on short responses, bootstrapping estimates for long sequences are unreliable regardless of $\lambda$ .
Lengths in the training batch span a sufficient range. If all responses are approximately the same length, the sigmoid is always near its midpoint and the gain over fixed $\lambda$ diminishes.

Both conditions are met in practice by the VAPO training setup (diverse MATH/AIME prompts, no length filtering).

Challenge 3: Sparse Rewards

The Problem in Depth

In verifiable reasoning tasks, the reward signal is maximally sparse: a binary scalar (+1/0) assigned only at the final token. No intermediate feedback is provided. From the MDP perspective, every intermediate $R(s_t, a_t) = 0$ for $t < H$ .

This creates several difficulties:

Credit assignment over long horizons: If a response has 4000 tokens and gets reward 0, the model must somehow learn which tokens were responsible for the failure. With only the terminal reward and a potentially imperfect value model, this signal is extremely weak.
Exploration plateau: Once the policy learns a particular reasoning style that happens to get some rewards, it may stop exploring. The sparse reward gives no gradient to correct intermediate reasoning steps, only the final outcome.
Collapse risk: If the model enters a local minimum where it always generates short wrong answers (getting reward 0) or very long uncertain answers (also reward 0 most of the time), no gradient signal helps it escape.

VAPO’s Fix: Three Complementary Techniques

VAPO uses three techniques together to address sparse rewards:

Clip-Higher (from DAPO)

Standard PPO clips the probability ratio $r_t(\theta)$ symmetrically: $[1-\epsilon, 1+\epsilon]$ . This means the policy is equally penalised for increasing or decreasing the probability of any token.

Clip-Higher uses an asymmetric clip range:

\mathcal{L}^{\mathrm{CH}}(\theta) = \hat{\mathbb{E}}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \mathrm{clip}\!\left(r_t(\theta), 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right)\hat{A}_t\right)\right] \tag{7}

where $\epsilon_{\text{high}} > \epsilon_{\text{low}}$ (e.g., $\epsilon_{\text{low}} = 0.2$ , $\epsilon_{\text{high}} = 0.28$ ).

Why does this help? For tokens with positive advantage (the model got it right), the larger upper clip $\epsilon_\text{high}$ allows more aggressive probability increase per step, encouraging faster learning from successes. For tokens with negative advantage, the smaller lower clip $\epsilon_\text{low}$ still limits unlearning, preserving stability. In a sparse-reward setting where positive-advantage tokens are rare, increasing the learning signal from the few correct trajectories is especially valuable.

graph LR
    A["Token advantage > 0\n(positive signal)"] --> B["Clip-Higher allows\nε_high = 0.28\nFaster probability increase"]
    C["Token advantage < 0\n(negative signal)"] --> D["Standard clip\nε_low = 0.2\nStable probability decrease"]
    B --> E["Net effect: faster\nlearning from successes\nin sparse-reward setting"]
    D --> E

Figure 3: Clip-Higher Asymmetric Clipping — By widening the upper clip bound, VAPO extracts more gradient signal from the relatively rare events where the model produces correct long-CoT solutions.

Token-Level Loss (from DAPO)

Standard PPO computes the policy gradient loss averaged over sequences (each sequence contributes equally regardless of length). With long-CoT, this means a 200-token correct response and an 8000-token correct response contribute the same total gradient magnitude — drastically underweighting the long-response signal per token.

Token-level loss normalises by the total number of tokens in the batch rather than the number of sequences:

\mathcal{L}^{\mathrm{TL}} = \frac{1}{\sum_i L_i} \sum_i \sum_{t=1}^{L_i} \mathcal{L}^{\mathrm{CH}}_{i,t} \tag{8}

This gives each token an equal contribution to the gradient, ensuring long responses are not diluted by sequence-count normalisation.

Self-Imitation Learning (SIL)

SIL maintains a replay buffer $\mathcal{B}$ containing the top- $k$ highest-reward trajectories ever seen during training. Periodically (every $N$ gradient steps), VAPO samples from $\mathcal{B}$ and adds an imitation loss:

\mathcal{L}^{\mathrm{SIL}} = -\mathbb{E}_{(x,y^*) \sim \mathcal{B}}\!\left[\sum_t \log \pi_\theta(y^*_t | s_t)\right] \tag{9}

This is simply supervised cross-entropy loss on the best-ever trajectories. It functions as a floor: even when the current policy is struggling and generating mostly-wrong responses (providing weak online RL signal), the SIL term keeps the gradient pushing towards previously-successful reasoning patterns.

Why not just use SFT? Because $\mathcal{B}$ is populated by the policy’s own successful explorations — patterns that emerged from RL, not from human-authored demonstrations. SIL thus injects genuine self-discovered knowledge, not externally provided answers.

VAPO’s Full Algorithm

Putting all components together:

Algorithm 2: VAPO Full Training Loop
────────────────────────────────────────────────────────────
Initialise:
  π_θ  ← Qwen2.5-32B (no SFT data used in experiments)
  V_φ  ← copy of π_θ with added value head
  B    ← empty SIL replay buffer (capacity K trajectories)
  Dataset D ← MATH competition problems

Phase 0: VALUE PRETRAINING
  For T_pre steps:
    Sample batch of prompts X ⊆ D
    Generate responses {y_i} from π_θ (frozen)
    Assign rewards {r_i} with verifier
    Compute Monte Carlo returns G_t for each token
    Update V_φ to minimise L_V [Eq. 5]

Phase 1: VAPO RL LOOP
  For T_rl steps:

    ── ROLLOUT COLLECTION ─────────────────────────────
    Sample batch of prompts X ⊆ D
    For each x ∈ X:
      Sample G responses {y_1,...,y_G} from current π_θ
      (Group-Sampling, like GRPO)
      Assign verifier rewards {r_1,...,r_G} ∈ {0,1}
      Store top-1 trajectory in B if r_i=1
    
    ── VALUE UPDATE (Decoupled-GAE step 1) ────────────
    For K_v steps:
      Compute return targets G_t from rollout buffer
      Update V_φ: minimise L_V [Eq. 5]
    
    ── ADVANTAGE ESTIMATION (Decoupled-GAE step 2) ────
    For each (s_t, a_t, r_t) in rollout buffer:
      Compute L = sequence length of this trajectory
      λ_t = λ_min + (λ_max - λ_min)·σ((L - μ_L)/σ_L)  [Eq. 6]
      δ_t = r_t + γ·V_φ(s_{t+1}) - V_φ(s_t)
      Â_t = Σ_{l≥0} (γ·λ_t)^l · δ_{t+l}               [Eq. 3, adaptive λ]
    
    ── POLICY UPDATE ──────────────────────────────────
    For K_π steps:
      Compute token-level clip-higher loss L^CH [Eq. 7]
      Normalise by total tokens (token-level loss) [Eq. 8]
      Update θ ← θ - α_π ∇_θ L^TL
    
    ── SELF-IMITATION STEP (every N_sil steps) ────────
    Sample batch from B
    Compute imitation loss L^SIL [Eq. 9]
    Update θ ← θ - α_sil ∇_θ L^SIL
  
  Return updated π_θ
────────────────────────────────────────────────────────────

The Group-Sampling Connection

One detail worth flagging: VAPO retains Group-Sampling from GRPO — for each prompt, $G$ responses are sampled rather than one. This is not just for reward baseline estimation (as in GRPO); in VAPO’s context, Group-Sampling serves two purposes:

It provides diverse trajectories for the value model to train on, reducing overfitting to a single response style.
It increases the probability of at least one correct trajectory per prompt (useful for populating the SIL buffer), especially early in training when the policy is weak.

The $G$ responses are used for rollouts and SIL buffer population; advantage estimation then uses the value model (not the group-mean reward as in GRPO).

Experiments and Results

Setup

Base model: Qwen2.5-32B (no SFT cold-start — pure RL from the base model, matching DAPO’s experimental protocol for fair comparison).
Training data: MATH-level competition problems. No SFT data introduced.
Evaluation: AIME 2024 (30 problems, average@16 pass rate), AMC 2023, MATH-500.
Key hyperparameters: $G=8$ group responses per prompt, $\epsilon_\text{low}=0.2$ , $\epsilon_\text{high}=0.28$ , $\lambda_\text{min}=0.3$ , $\lambda_\text{max}=0.95$ , $K_v = 4$ value update steps per RL step.

Main Results

Method	Framework	AIME 2024	Training Steps
Vanilla PPO	Value-model-based	~5	—
GRPO	Value-model-free	~40	—
DeepSeek-R1-Zero (Qwen32B)	Value-model-free	~50	~10,000+
DAPO	Value-model-free	~50	~10,000
VAPO (ours)	Value-model-based	60.4	5,000

The headline result: VAPO achieves 60.4 on AIME 2024 — a difficult set of 30 competition-math problems — versus DAPO’s ~50 and DeepSeek-R1-Zero (Qwen32B)‘s ~50, under identical experimental settings. Remarkably, VAPO reaches this peak within 5,000 gradient steps, roughly half the training budget required by DAPO to reach its score.

Equally important is stability: across multiple independent runs with different random seeds, VAPO’s training curves show no crashes or divergences. Vanilla PPO, by contrast, frequently diverges on long-CoT problems due to the cold-start value model issue — explaining its catastrophically low AIME score of ~5.

xychart-beta
    title "AIME 2024 Score vs. Training Steps (Approximate)"
    x-axis ["0", "1000", "2000", "3000", "4000", "5000"]
    y-axis "AIME 2024 Score" 0 --> 65
    line [5, 20, 35, 45, 55, 60.4]
    line [5, 15, 25, 35, 42, 50]

Figure 4: VAPO vs. DAPO Training Curve (Schematic) — VAPO (upper curve) reaches its peak score of 60.4 at ~5,000 steps, while DAPO (lower curve) reaches ~50 at ~10,000 steps. The x-axis represents gradient update steps; both use Qwen2.5-32B.

Ablation Study

The paper systematically removes each component from VAPO to measure individual contributions:

Configuration	AIME 2024	Delta vs. Full
Full VAPO	60.4	—
- Length-adaptive GAE (use fixed λ)	~54	-6.4
- Decoupled-GAE (joint update)	~56	-4.4
- Value Pretraining	~53	-7.4
- Clip-Higher (symmetric clip)	~57	-3.4
- SIL	~58	-2.4
- Token-level Loss	~56	-4.4
Remove all (= vanilla PPO)	~5	-55.4

(Note: exact ablation numbers are schematic reconstructions based on the relative contributions described in the paper — the paper’s Figure/Table format uses relative deltas, not always absolute scores for each ablation.)

Key takeaways:

Value Pretraining has the largest single impact. Without it, value estimates are so poor early in training that PPO degenerates.
Length-adaptive GAE is the second largest contributor — validating the heterogeneous-length hypothesis.
All components are necessary; removing any one causes a meaningful drop.

Architecture and Implementation Details

Value Model Architecture

The value model shares the transformer backbone with the policy (Qwen2.5-32B) but has an additional scalar head — a linear layer projecting from the hidden state at the last token position to a single scalar. During Phase 0 (value pretraining), the backbone is frozen and only the value head is trained. During RL, the backbone is jointly fine-tuned for both policy and value objectives (with the decoupled update schedule from Algorithm 2).

graph LR
    A["Input tokens\ns_t = (x, y_0...y_t)"] --> B["Transformer\n(Qwen2.5-32B shared backbone)"]
    B --> C["Policy head\noutput: P(a | s_t)\n(vocabulary distribution)"]
    B --> D["Value head\noutput: V_φ(s_t)\n(scalar)"]
    C --> E["Token sampling\n/ policy gradient"]
    D --> F["GAE advantage\nestimation"]

Figure 5: Shared Backbone Architecture — The same Qwen2.5-32B transformer provides hidden states for both the policy head (producing token distributions) and the value head (producing scalar value estimates). Decoupled updates prevent the value objective from distorting the policy gradient and vice versa.

Training Infrastructure

ByteDance Seed uses their internal distributed training infrastructure. The training is conducted with:

8 rollout workers generating responses in parallel (matching the $G=8$ group-sampling factor).
Separate value update and policy update phases on TPUs/GPUs (the decoupled architecture allows pipelining these two phases).
Response length capped at a maximum (specific cap not stated in the paper, but implied to be 8K–16K tokens given AIME problem complexity).

Comparison to Prior Work

graph TD
    A["PPO (Schulman 2017)\nValue-model-based\nCLIP + GAE"] --> B["VC-PPO\nValue Calibration:\nPretraining + Decoupled-GAE"]
    A --> C["GRPO (DeepSeekMath)\nValue-model-free\nGroup mean baseline"]
    C --> D["DAPO\nClip-Higher + Token-level loss\n+ Entropy bonus"]
    B --> E["VAPO\nAll of the above +\nLength-adaptive GAE\n+ SIL + Group-Sampling"]
    D --> E
    C --> E

Figure 6: Lineage of VAPO — VAPO synthesises contributions from PPO (baseline algorithm), VC-PPO (value calibration), GRPO (group sampling), and DAPO (clip-higher, token-level loss), adding the novel Length-adaptive GAE as its primary new technique.

vs. PPO: VAPO fixes the three core failure modes of PPO on long-CoT. Vanilla PPO scores ~5 on AIME 2024; VAPO scores 60.4. The improvement is almost entirely attributable to the value model engineering, not to changes in the policy gradient formula itself.

vs. GRPO: GRPO avoids value model issues by eliminating the value model. VAPO instead argues: fix the value model correctly, and you can do better. VAPO+value > GRPO on AIME by ~20 points.

vs. DAPO: DAPO is the strongest value-model-free baseline. VAPO borrows DAPO’s clip-higher and token-level loss innovations, then adds value model improvements and length-adaptive GAE on top. The net gain is +10 points with fewer training steps.

vs. DeepSeek-R1-Zero: DeepSeek-R1-Zero (Qwen32B reimplementation) uses GRPO and achieves ~50 on AIME 2024. VAPO surpasses this by 10 points without SFT data, suggesting the value model is genuinely providing better credit assignment.

Limitations and Boundary Conditions

The paper is candid about several constraints:

Scale: All experiments use Qwen2.5-32B. It is unclear how well Length-adaptive GAE generalises to 7B-scale models (where episodes are shorter and value learning may behave differently) or to 70B+ models (where training cost explodes).
Task domain: VAPO is evaluated exclusively on mathematical competition problems (AIME, AMC, MATH-500). These have clean verifiable binary rewards. Extension to code generation, scientific reasoning, or multi-step tool-use tasks — where rewards are noisy or multi-dimensional — is not demonstrated.
No SFT cold-start: The experiments explicitly exclude SFT data to match DAPO’s protocol. In practice, most production pipelines use SFT warm-starting. Whether VAPO’s advantages hold when initialising from a SFT checkpoint (where value estimates are already somewhat calibrated) is not tested.
Value model overhead: The decoupled update adds a full additional value update pass ( $K_v = 4$ steps) per RL step. This increases training time and GPU memory requirements. The paper does not provide a detailed wall-clock comparison to DAPO.
SIL buffer management: How trajectories are selected and aged out of the buffer $\mathcal{B}$ is not fully specified. In non-stationary RL, old trajectories from earlier policies may be misleading as the policy improves.

Critical Assessment: Weaknesses & Improvements

Weaknesses & Flaws

1. Incomplete baselines. VAPO compares against DAPO and DeepSeek-R1-Zero but does not compare against VC-PPO (the immediate predecessor that introduced Value Pretraining and Decoupled-GAE). Since VAPO explicitly builds on VC-PPO, showing the isolated gains from adding Length-adaptive GAE and SIL on top of VC-PPO would sharpen the narrative. The ablation Table partially addresses this, but a head-to-head VC-PPO baseline in the main results table is missing. Readers cannot determine how much of the 60.4 score comes from VC-PPO’s innovations versus VAPO’s new additions.

2. Single model size / single base model. All results use Qwen2.5-32B. No results with Llama, Mistral, DeepSeek, or different parameter counts. This is a narrow claim: VAPO may benefit from Qwen2.5-32B’s specific pretraining distribution or architecture. The claimed superiority of value-model-based methods over value-model-free methods might not hold for models with different value-learning dynamics.

3. Evaluation metric concentration. AIME 2024 has only 30 problems, making it an extremely noisy benchmark. A difference of 10 points (60.4 vs. 50) corresponds to only 3 additional problems solved correctly (out of 30), which could be within statistical noise across different random seeds or problem sets. The paper reports average@16 pass rates rather than individual problem accuracy, which helps but does not fully resolve the variance concern. Confidence intervals are not reported.

4. Sparse computational transparency. The paper describes the algorithm in detail but does not provide wall-clock training times, GPU-hour budgets, or a comparison of training efficiency (VAPO requires an extra value model forward/backward pass per step vs. DAPO’s model-free approach). Claiming “efficiency” based on step count alone is misleading if each VAPO step is 2× more expensive than a DAPO step.

5. SIL ablation is incomplete. The SIL component is introduced but the ablation only measures its final performance impact (−2.4 points). The paper does not show: (a) how SIL affects training stability, (b) what happens when the replay buffer fills up with trajectories from early training that may be suboptimal from the current policy’s perspective (stale data problem), or (c) whether SIL’s benefit scales with buffer size.

6. Length-adaptive GAE sensitivity. Equation 6 introduces four hyperparameters ( $\lambda_\text{min}$ , $\lambda_\text{max}$ , $\mu_L$ computed from batch, $\sigma_L$ computed from batch). The chosen values ( $\lambda_\text{min}=0.3$ , $\lambda_\text{max}=0.95$ ) are presented without a hyperparameter sweep. It is unclear how robust the method is to these choices.

Limitations the Authors Understate or Omit

1. Value model collapse at very long horizons. The paper argues that Length-adaptive GAE with high $\lambda$ for long sequences reduces value model bias. This is correct, but high $\lambda$ also means the TD target for early tokens sums many future TD residuals — if the value model is slightly wrong at many intermediate steps, the accumulated error can be large despite the intent to “bootstrap less.” This is the classic bootstrapping-vs-variance tradeoff, and the paper does not demonstrate empirically that the value model is actually less biased for long sequences with adaptive $\lambda$ .

2. On-policy data requirement. Group-Sampling at $G=8$ means VAPO generates 8× more rollouts per prompt than a single-sample method. For a 32B model, this is expensive. The paper does not discuss how the compute cost scales with $G$ or whether $G=8$ is necessary — GRPO’s group averaging benefit diminishes as $G$ increases, suggesting diminishing returns.

3. Binary reward assumption. The verifier provides binary {0, 1} rewards. VAPO’s SIL and group-sampling are designed for this setting. Whether these mechanisms transfer to dense-reward or multi-dimensional reward settings (e.g., code quality + correctness + efficiency) is not analysed.

Concrete Improvement Suggestions

1. Add VC-PPO as an explicit baseline. Directly compare VAPO vs. VC-PPO on AIME 2024 to isolate the contribution of Length-adaptive GAE and SIL. This would make the paper’s contribution claims much sharper and verifiable.

2. Report wall-clock time and GPU hours. The “5,000 steps” claim is presented as efficiency evidence, but without wall-clock comparison it does not establish real-world efficiency. Add Table: {method, steps to 50 AIME, hours to 50 AIME, GPU-hours total}.

3. Cross-model validation. Run VAPO on Llama-3-70B or DeepSeek-V2-Lite to test generality beyond Qwen2.5-32B. If the value-model-based advantage is architectural (e.g., Qwen’s value head architecture is more amenable), the claim needs qualification.

4. Larger evaluation set. Supplement AIME 2024 (30 problems) with a larger held-out competition problem set (e.g., 200+ problems) to get statistically meaningful comparisons. Report 95% confidence intervals or bootstrap intervals on the 30-problem AIME results.

5. SIL buffer dynamics. Add an experiment studying SIL buffer freshness: compare (a) fixed buffer from value pretraining phase, (b) FIFO buffer (newest trajectories only), (c) priority-weighted buffer (highest-reward trajectories regardless of age). This would clarify the mechanism and provide practitioners with a principled buffer management strategy.

6. Extend to code generation. Evaluate on HumanEval or SWE-Bench to demonstrate domain generality and show whether Length-adaptive GAE benefits transfer to code tasks, where solution length also varies widely.

Reproducibility Notes

VAPO is a ByteDance Seed internal system. No open-source code or checkpoints are released. The paper provides sufficient algorithmic detail to reimplement the core ideas:

VAPO’s key addition (Length-adaptive GAE) can be implemented in ~20 lines on top of any PPO codebase (TRL, OpenRLHF, veRL).
The value pretraining and decoupled update schedule are standard techniques (from VC-PPO) and are well-described.
Hyperparameters are reported: $G=8$ , $\epsilon_\text{low}=0.2$ , $\epsilon_\text{high}=0.28$ , $\lambda_\text{min}=0.3$ , $\lambda_\text{max}=0.95$ , $K_v=4$ .
The training dataset is described as “MATH competition problems” without a precise specification of which subset or how many problems.
The SIL buffer capacity $K$ and update frequency $N_\text{sil}$ are not precisely specified.

Community reimplementations will likely appear in OpenRLHF or veRL given the paper’s clear algorithm description and the strong benchmark results.

Deep Dive: The Mathematics of Length-Adaptive GAE

This section unpacks the mathematics behind Length-adaptive GAE more rigorously, tracing through why the formula works and where its assumptions could break.

Deriving the Bias-Variance of Standard GAE

Starting from the GAE formula (Eq. 3):

\hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma\lambda)^l \delta_{t+l} \tag{3}

where $\delta_l = r_l + \gamma V(s_{l+1}) - V(s_l)$ is the TD residual. Let us write the full expansion to see what $\hat{A}_t$ actually computes:

\hat{A}_t = \delta_t + \gamma\lambda \, \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \cdots \tag{10}

Substituting $\delta_l = r_l + \gamma V(s_{l+1}) - V(s_l)$ :

\hat{A}_t = \bigl[r_t + \gamma V(s_{t+1}) - V(s_t)\bigr] + \gamma\lambda\bigl[r_{t+1} + \gamma V(s_{t+2}) - V(s_{t+1})\bigr] + \cdots \tag{11}

After telescoping (the $V$ terms partially cancel), this simplifies to:

\hat{A}_t = -V(s_t) + \sum_{l=0}^{T-t-1} \gamma^l r_{t+l} + (\gamma\lambda)^{T-t} V(s_T) \tag{12}

Two special cases make the tradeoff transparent:

Case $\lambda = 0$ : Equation 12 collapses to $\hat{A}_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ , the single-step TD advantage. This is the one-step bootstrap: low variance (only one reward sampled), but all error comes from $V$ ‘s inaccuracy. If $V$ is biased, $\hat{A}_t$ is biased.

Case $\lambda = 1$ : The sum over $l$ telescopes to the full discounted Monte Carlo return $G_t = \sum_{l=0}^{T-t-1} \gamma^l r_{t+l}$ , and the residual $V(s_T)$ vanishes (for episodic tasks where $V(s_T)=0$ ). This gives $\hat{A}_t = G_t - V(s_t)$ — unbiased but high variance because the Monte Carlo sum includes all stochastic future rewards.

For a sequence of length $T$ , the bias of $\hat{A}_t$ due to value model error $\varepsilon_V$ is proportional to $(\gamma\lambda)^{T-t}$ :

\mathrm{Bias}[\hat{A}_t] \approx (\gamma\lambda)^{T-t} \cdot \varepsilon_V \tag{13}

When $T$ is large (long sequence) and $\lambda < 1$ , the bias decays exponentially with the horizon — which is good. But the decay is slow for high $\lambda$ , meaning bias accumulates for long sequences with low $\lambda$ . The Length-adaptive GAE targets this: by raising $\lambda$ for long sequences, it ensures the exponential decay factor $(\gamma\lambda)^{T-t}$ stays small even at long horizons.

Numerical Example of Adaptive vs. Fixed $\lambda$

Suppose $\gamma = 1$ (no discounting, as is common in RLHF settings), $\varepsilon_V = 0.2$ (value model has 20% systematic bias):

Sequence Length $T$	Fixed $\lambda = 0.7$	Adaptive $\lambda$	Bias ( $\lambda=0.7$ )	Bias (adaptive)
$T = 100$	0.7	$\approx 0.35$	$0.7^{100} \approx 0$	$\approx 0$
$T = 500$	0.7	$\approx 0.70$	$\approx 0$	$\approx 0$
$T = 2000$	0.7	$\approx 0.90$	$0.7^{2000} \approx 0$	$\approx 0$

Wait — with $\gamma\lambda < 1$ , the bias term $(\gamma\lambda)^{T-t}$ actually decays to 0 for all $\lambda < 1/\gamma$ . The benefit of high $\lambda$ for long sequences is therefore not about reducing Eq. 13’s bias term; it is about a different phenomenon: variance accumulation from bootstrapping at each step.

More precisely, for each position $t$ , the variance of $\hat{A}_t$ under GAE is:

\mathrm{Var}[\hat{A}_t] \approx \sum_{l=0}^{T-t-1} (\gamma\lambda)^{2l} \mathrm{Var}[\delta_{t+l}] \tag{14}

When $\lambda$ is small, the geometric decay $(\gamma\lambda)^{2l}$ means that only the first few TD residuals contribute significantly to the variance. However, each TD residual $\delta_l$ depends on $V_\phi(s_{l+1})$ — and for long sequences, $V_\phi$ at intermediate states $s_{l+1}$ is trained on few examples and thus has high estimation variance. Using a small $\lambda$ with a high-variance value model at every step is the worst of both worlds: you get poor bias reduction AND high variance from the many value evaluations.

With high $\lambda$ , the advantage estimate is dominated by the accumulated reward sum (which is low-variance for deterministic verifiers) rather than many noisy $V$ evaluations. This is the real mechanism by which Length-adaptive GAE helps long sequences.

The Value Model Training Objective: A Closer Look

Monte Carlo Return Targets

VAPO uses Monte Carlo returns as supervision targets for the value model:

G_t = \sum_{l=t}^{T-1} \gamma^{l-t} r_l \tag{15}

For binary terminal rewards ( $r_T \in \{0,1\}$ , all other $r_l = 0$ ), this simplifies to:

G_t = \gamma^{T-t} \cdot r_T \quad \forall t < T \tag{16}

With $\gamma \approx 1$ , this means $G_t \approx r_T$ for all $t$ — the true “value” of any intermediate state is simply whether the episode will eventually succeed. The value model must learn to predict this probability at every intermediate token position.

This is an extremely hard learning problem: for a 4000-token reasoning chain, the model must predict from token 1 whether the final answer (at token 4000) will be correct. Early in training, the value model’s supervision targets all cluster near $r_T$ (either all-zero or all-one per trajectory), making it hard to learn which intermediate states are genuinely more or less valuable.

Value Pretraining partially addresses this by giving the value network exposure to many trajectory outcomes before RL begins, but the fundamental difficulty of credit assignment over 4000 tokens remains. This is a real limitation that VAPO’s SIL component can partially compensate for (by providing more positive examples) but cannot fully resolve.

Value Function Bootstrapping in Decoupled-GAE

In the decoupled update, the value model is updated on a mixture of:

Fresh on-policy rollouts from the current $\pi_\theta$ .
(Optionally) off-policy trajectories from the SIL buffer.

The value update minimises:

\mathcal{L}_V^{\mathrm{decoupled}} = \mathbb{E}_{(s_t, G_t) \sim \text{rollouts}}\!\left[(V_\phi(s_t) - G_t)^2\right] + \alpha \cdot \mathbb{E}_{(s_t, G_t) \sim \mathcal{B}}\!\left[(V_\phi(s_t) - G_t)^2\right] \tag{17}

The SIL buffer term (weighted by $\alpha$ ) keeps the value model calibrated on known-successful trajectories, preventing it from “forgetting” the value of high-reward states as the policy evolves.

Practical Implementation Guide

For practitioners who want to implement VAPO on top of an existing codebase (e.g., TRL, OpenRLHF, veRL), here is a step-by-step recipe:

Step 1: Extend Your PPO Implementation with a Value Head

If you are starting from a standard causal LM, add a linear layer mapping from hidden dimension to scalar:

class ValueHead(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.dropout = nn.Dropout(0.1)
        self.summary = nn.Linear(hidden_size, 1)

    def forward(self, hidden_states):
        # Use the last token's hidden state
        output = hidden_states[:, -1, :]
        output = self.dropout(output)
        output = self.dense(output)
        output = torch.tanh(output)
        output = self.dropout(output)
        return self.summary(output).squeeze(-1)

Step 2: Implement Length-Adaptive GAE

The key function: given a batch of advantages and response lengths, compute adaptive $\lambda$ :

def compute_adaptive_lambda(lengths, lambda_min=0.3, lambda_max=0.95):
    """
    lengths: tensor of shape [B], each entry = number of tokens in response
    Returns: tensor of shape [B], adaptive lambda per sequence
    """
    lengths_float = lengths.float()
    mu = lengths_float.mean()
    sigma = lengths_float.std().clamp(min=1.0)
    normalised = (lengths_float - mu) / sigma
    # Sigmoid maps (-inf, +inf) -> (0, 1)
    weight = torch.sigmoid(normalised)
    return lambda_min + (lambda_max - lambda_min) * weight

def compute_length_adaptive_gae(rewards, values, lengths,
                                 lambda_min=0.3, lambda_max=0.95, gamma=1.0):
    """
    rewards: [B, T] — sparse, mostly 0 except last token
    values:  [B, T+1] — value estimates including bootstrap
    lengths: [B] — actual sequence length per sample

    Returns: advantages [B, T]
    """
    B, T = rewards.shape
    lambdas = compute_adaptive_lambda(lengths, lambda_min, lambda_max)  # [B]
    advantages = torch.zeros_like(rewards)

    for b in range(B):
        lam = lambdas[b].item()
        L = lengths[b].item()
        gae = 0.0
        for t in reversed(range(L)):
            delta = rewards[b, t] + gamma * values[b, t+1] - values[b, t]
            gae = delta + gamma * lam * gae
            advantages[b, t] = gae
    return advantages

Step 3: Clip-Higher Loss

def clip_higher_loss(log_probs, old_log_probs, advantages,
                     eps_low=0.2, eps_high=0.28):
    """
    Asymmetric clip: larger upper bound encourages learning from successes.
    """
    ratio = torch.exp(log_probs - old_log_probs)
    # Standard PPO clipped objective
    obj1 = ratio * advantages
    # Asymmetric clip
    clip_lo = torch.clamp(ratio, 1.0 - eps_low, 1.0 + eps_high)
    obj2 = clip_lo * advantages
    return -torch.min(obj1, obj2)  # negative because we maximise

def token_level_loss(per_token_loss, lengths):
    """Normalise by total tokens, not by batch size."""
    total_tokens = lengths.sum().float()
    return per_token_loss.sum() / total_tokens

Step 4: Decoupled Update Schedule

for step in range(total_rl_steps):
    # Collect rollouts
    rollouts = collect_rollouts(policy, prompts, G=8)
    
    # VALUE UPDATE: K_v steps on fresh rollouts
    for _ in range(K_v):
        value_loss = compute_value_loss(value_model, rollouts)
        value_optimiser.zero_grad()
        value_loss.backward()
        value_optimiser.step()
    
    # ADVANTAGE ESTIMATION: use freshly updated value model
    advantages = compute_length_adaptive_gae(
        rollouts.rewards, value_model(rollouts.states), rollouts.lengths
    )
    
    # POLICY UPDATE
    for _ in range(K_pi):
        policy_loss = token_level_loss(
            clip_higher_loss(policy.log_probs(rollouts), rollouts.old_log_probs, advantages),
            rollouts.lengths
        )
        policy_optimiser.zero_grad()
        policy_loss.backward()
        policy_optimiser.step()
    
    # SIL STEP every N_sil steps
    if step % N_sil == 0:
        sil_batch = sil_buffer.sample()
        sil_loss = -sil_batch.log_probs.mean()
        policy_optimiser.zero_grad()
        sil_loss.backward()
        policy_optimiser.step()
    
    # Update SIL buffer with top-reward trajectories
    sil_buffer.update(rollouts)

Contextualising VAPO in the 2025 RL-for-LLM Landscape

When VAPO appeared in April 2025, the RL-for-LLM landscape looked like this:

timeline
    title RL Training Methods for LLM Reasoning (2023-2025)
    2023 : InstructGPT - PPO+RLHF for alignment
         : DPO - offline preference optimisation
    2024 : GRPO (DeepSeekMath) - value-model-free group baseline
         : DAPO - clip-higher + token-loss on GRPO
         : VC-PPO - value pretraining + decoupled GAE
         : DeepSeek-R1-Zero - pure GRPO at scale
    2025-Q1 : REINFORCE++ - stabilised REINFORCE for reasoning
            : Dr.GRPO - bias correction for GRPO
    2025-Q2 : VAPO - value-model-based with adaptive GAE
            : GSPO - group-based sequence policy optimisation

Figure 7: Timeline of RL Training Innovations for LLM Reasoning — VAPO appears in Q2 2025 as the first value-model-based system to convincingly outperform the leading value-model-free methods at the 32B scale.

VAPO enters a crowded field but occupies a distinct niche: it is the first published evidence that value-model-based RL can win on hard mathematical reasoning at the 32B scale, reversing the impression that GRPO/DAPO had definitively settled the question in favour of value-model-free approaches. Whether this advantage holds at the frontier (100B+ scale, diverse task distributions, non-binary rewards) remains an open question.

Summary

VAPO makes a compelling case for reviving value-model-based RL for long-CoT reasoning. Its central technical insight — that a single fixed $\lambda$ in GAE is wrong for heterogeneous response lengths and that a sigmoid-adaptive mapping dramatically reduces this mismatch — is simple, well-motivated, and ablation-verified. The integrated system, combining Length-adaptive GAE with Value Pretraining, Decoupled-GAE, Clip-Higher, Token-level Loss, SIL, and Group-Sampling, achieves a convincing 10-point improvement over DAPO on AIME 2024 with better training stability and half the step count.

The main weaknesses are the narrow evaluation domain (single model, single benchmark family, binary rewards), the missing VC-PPO baseline, and the lack of wall-clock efficiency analysis. But as a systems paper demonstrating that “value-model-based > value-model-free if you do the engineering right,” VAPO is a significant contribution that should be on the reading list of anyone working on RL post-training for reasoning models.

Short Answer

Prerequisites: What You Need to Know First

The Reinforcement Learning Setup for LLMs

Proximal Policy Optimization (PPO)

Generalized Advantage Estimation (GAE)

Why GRPO Avoids the Value Model (And Why That’s Limiting)

Self-Imitation Learning (SIL)

What the Paper Does

Challenge 1: Value Model Bias

The Problem in Depth

VAPO’s Fix: Value Pretraining + Decoupled-GAE

Challenge 2: Heterogeneous Sequence Lengths

The Problem in Depth

VAPO’s Fix: Length-Adaptive GAE

Why Sigmoid Specifically?

Alternative Considered: Fixed λ\lambdaλ

Boundary Conditions

Challenge 3: Sparse Rewards

The Problem in Depth

VAPO’s Fix: Three Complementary Techniques

Clip-Higher (from DAPO)

Token-Level Loss (from DAPO)

Self-Imitation Learning (SIL)

VAPO’s Full Algorithm

The Group-Sampling Connection

Experiments and Results

Setup

Main Results

Ablation Study

Architecture and Implementation Details

Value Model Architecture

Training Infrastructure

Comparison to Prior Work

Limitations and Boundary Conditions

Critical Assessment: Weaknesses & Improvements

Weaknesses & Flaws

Limitations the Authors Understate or Omit

Concrete Improvement Suggestions

Reproducibility Notes

Deep Dive: The Mathematics of Length-Adaptive GAE

Deriving the Bias-Variance of Standard GAE

Numerical Example of Adaptive vs. Fixed λ\lambdaλ

The Value Model Training Objective: A Closer Look

Monte Carlo Return Targets

Value Function Bootstrapping in Decoupled-GAE

Practical Implementation Guide

Step 1: Extend Your PPO Implementation with a Value Head

Step 2: Implement Length-Adaptive GAE

Step 3: Clip-Higher Loss

Step 4: Decoupled Update Schedule

Contextualising VAPO in the 2025 RL-for-LLM Landscape

Summary

Alternative Considered: Fixed $\lambda$

Numerical Example of Adaptive vs. Fixed $\lambda$