June 16, 2026 EN #RLHF #Reinforcement Learning #LLM Training

Back to Basics: Revisiting REINFORCE Style Optimization for RLHF (RLOO)

Review date: 2026-06-16 Review author: Zhongzhu Zhou Paper reviewed: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper authors: Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker arXiv: 2402.14740 Status / Venue: NeurIPS 2024

Short Answer

PPO is the canonical RL algorithm adopted for RLHF, but this paper argues — and demonstrates — that it is the wrong tool for the job. The RLHF fine-tuning setting violates almost every assumption that PPO was designed to address. A dramatically simpler policy-gradient variant, REINFORCE Leave-One-Out (RLOO), drops the critic network and the clipping mechanism, uses $k$ sequence-level completions per prompt as a self-baseline, and consistently outperforms PPO (and also DPO and RAFT) across all tested models and datasets.

1. Prerequisites: What You Need to Know First

Before reading this paper, you need to be comfortable with the following ideas. I’ll walk through each one carefully, because the paper’s argument flows entirely from a single deep point: PPO was designed for a regime that RLHF does not inhabit.

1.1 Policy Gradient Basics

The fundamental goal in RL is to find a policy $\pi_\theta$ (parameterized by $\theta$ ) that maximizes expected cumulative reward:

J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot|x)} [R(x, y)] \tag{1}

The policy gradient theorem (Williams, 1992) says the gradient of this objective is:

\nabla_\theta J(\theta) = \mathbb{E}_{x, y} \left[ R(x,y) \cdot \nabla_\theta \log \pi_\theta(y|x) \right] \tag{2}

Intuition: If a completion $y$ has high reward $R(x,y)$ , we want to increase its probability, which means climbing in the direction $\nabla_\theta \log \pi_\theta(y|x)$ . The reward acts as a scalar weight telling us how aggressively to climb.

1.2 The Variance Problem and Why Baselines Help

Equation (2) is unbiased but has extremely high variance when estimated from finite samples. The standard fix is to subtract a baseline $b(x)$ from the reward:

\nabla_\theta J(\theta) = \mathbb{E}_{x, y} \left[ (R(x,y) - b(x)) \cdot \nabla_\theta \log \pi_\theta(y|x) \right] \tag{3}

This is still unbiased (because $\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x)] = 0$ ), and if $b(x)$ is correlated with $R(x,y)$ , the variance drops substantially. Common baselines include:

Running mean baseline: $b(x) = \bar{R}$ (average reward across recent batches).
State value function baseline: $b(x, y_{<t}) = V(x, y_{<t})$ — a learned neural network predicting expected future reward from the current partial sequence. This is what PPO uses.

1.3 PPO: Proximal Policy Optimization

PPO (Schulman et al., 2017) is the dominant deep RL algorithm for continuous control and discrete action games. Its design philosophy centers on stability: it restricts how much the policy can change per gradient step, preventing catastrophic policy collapses in environments with high-variance gradients.

PPO makes two key choices:

(A) Token-level MDP formulation. Each token $y_t$ is an action; each prefix $(x, y_{<t})$ is a state. The reward is sparse: only the final token (EOS) receives the reward model score; all intermediate tokens receive only the KL penalty:

R_t(x, y_t) = \begin{cases} r_\phi(x,y) - \beta \log \frac{\pi_\theta(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} & \text{if } t = T \text{ (EOS)} \\ -\beta \log \frac{\pi_\theta(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} & \text{otherwise} \end{cases} \tag{4}

(B) Clipped surrogate objective. To prevent large updates, PPO clips the probability ratio $f_t = \pi_\theta(y_t|s_t) / \pi_{\text{old}}(y_t|s_t)$ :

\mathcal{L}_{\text{PPO}} = \mathbb{E}_t \left[ \min\!\left( f_t \hat{A}_\lambda(y_t, s_t),\; \text{clip}_{1-\epsilon}^{1+\epsilon}(f_t)\, \hat{A}_\lambda(y_t, s_t) \right) \right] \tag{5}

where $\hat{A}_\lambda(y_t, s_t)$ is the Generalized Advantage Estimation (GAE) — a blend of TD-error terms using a learned critic (value function) $V_\psi(s_t)$ . This requires:

A generator model $\pi_\theta$
A reference model $\pi_{\text{ref}}$ (for KL computation)
A critic model $V_\psi$ (same size as the policy, trained in parallel)
A reward model $r_\phi$

That is four models in GPU memory simultaneously for typical PPO-RLHF implementations.

1.4 The RLHF 3-Stage Pipeline

The standard RLHF setup (Ziegler et al., 2019; InstructGPT) has three stages:

SFT Stage: Fine-tune a pretrained LM on curated (prompt, response) pairs to get $\pi^{\text{sft}}$ .
Reward Model Stage: Train $r_\phi(x, y)$ on preference data $\{(x, y^+, y^-)\}$ using:

\mathcal{L}_{\text{RM}} = -\log \sigma(r_\phi(x, y^+) - r_\phi(x, y^-)) \tag{6}

RL Stage: Use $r_\phi$ as online feedback to optimize the policy with a KL penalty:

\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x,y) - \beta D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right) \right] \tag{7}

The KL term keeps the policy from deviating so far from $\pi_{\text{ref}}$ that it degenerates (reward hacking). Rewriting this as a scalar reward:

R(x,y) = r_\phi(x,y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \tag{8}

1.5 RL-Free Methods: DPO and RAFT

Recent work tried to avoid RL entirely:

DPO (Rafailov et al., 2023): Reparameterizes the optimal policy in closed form and derives a supervised loss over preference pairs — no RL loop, no reward model at inference.
RAFT (Dong et al., 2023): Reward rAnked Fine-Tuning — sample multiple completions, rank them by reward model score, and SFT on the top fraction.

These bypass RL complexity but sacrifice online exploration — the policy never tries completions outside the fixed dataset distribution.

2. The Core Argument: Why PPO Is the Wrong Tool for RLHF

Figure 1: The LLM-RLHF training setting vs. the classic deep-RL setting

┌──────────────────────────────────────────────────────────────────────────┐
│               Classic Deep-RL (Atari / MuJoCo)                          │
│                                                                          │
│  Policy: Random initialization                                           │
│  Action space: Huge (all game moves, continuous joints)                 │
│  Reward signal: Dense per step, can be assigned per token               │
│  Update magnitude: LARGE → PPO clipping is essential                   │
│  # Models: 1 policy + 1 critic                                          │
└──────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────┐
│               LLM-RLHF (Instruction Following / Summarization)          │
│                                                                          │
│  Policy: Pre-trained + SFT → highly concentrated probability mass       │
│  Action space: Large vocabulary BUT most tokens never sampled           │
│  Reward signal: SPARSE (only at EOS), no true intermediate rewards      │
│  Update magnitude: SMALL (fine-tuning, not training from scratch)       │
│  # Models if PPO: 4 (generator + reference + critic + reward model)     │
└──────────────────────────────────────────────────────────────────────────┘

The paper makes four structural observations about why RLHF ≠ classic RL:

Initialization matters enormously. The SFT policy is far from random — its probability mass is concentrated on a small set of grammatically valid, semantically coherent tokens. This means gradient updates from policy gradient are naturally small, and the primary motivation for PPO’s clipping (preventing large destructive updates) does not apply.
Intermediate rewards are fake. In PPO-RLHF, only the EOS token gets a real reward. All intermediate tokens get only a KL penalty. This means the “states” in the token-level MDP carry no meaningful reward signal, and the critic has nothing real to estimate except the discounted sum of these fake per-token KL penalties. This is wasteful and introduces bias.
GAE bias is a poor trade. GAE reduces variance by bootstrapping from the learned critic, but it introduces bias. In high-variance classical RL (random initialization), this trade is worthwhile. In low-variance RLHF (from a strong SFT starting point), the variance is already manageable, so the bias from GAE outweighs its benefit — as the paper shows empirically.
The critic doubles memory cost. Loading an additional model the size of the policy doubles GPU memory. For 7B+ parameter models, this is prohibitive.

3. The REINFORCE Formulation for LLMs

3.1 Treating the Full Generation as One Action (Bandit Formulation)

Instead of modeling each token as an action (the MDP view), treat the entire generation $y = (y_1, \ldots, y_T)$ as a single multi-dimensional action drawn from the prompt state $x$ . This is the contextual bandit formulation.

Under this view, the policy gradient becomes:

\nabla_\theta J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot|x)} \left[ R(x,y) \cdot \nabla_\theta \log \pi_\theta(y|x) \right] \tag{9}

where $\log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t})$ decomposes factorially across tokens. This is backpropagable via the sum-of-log-probabilities trick.

The key difference from PPO: there is no discount, no GAE, no critic, no clipping, and no token-level advantage estimation. The entire sequence reward $R(x,y)$ propagates uniformly to all token log-probabilities.

3.2 Adding a Baseline: REINFORCE with Running Mean

With a single sample per prompt, variance is high. Adding a running-mean baseline $b = \bar{R}$ over the batch gives:

\nabla_\theta J(\theta) = \mathbb{E}_{x, y} \left[ (R(x,y) - \bar{R}) \cdot \nabla_\theta \log \pi_\theta(y|x) \right] \tag{10}

This baseline is still unbiased but can reduce variance when $\bar{R}$ correlates well with $R(x,y)$ . However, $\bar{R}$ is prompt-agnostic — it doesn’t distinguish between easy prompts (where all completions get high reward) and hard prompts (where most completions fail). A prompt-conditional baseline is much better.

3.3 Variance-Bias Tradeoff: λ Ablation

The paper first ablates PPO’s GAE hyperparameter $\lambda$ (which controls the bias-variance tradeoff). Recall GAE:

\hat{A}_\lambda(y_t, s_t) = \sum_{\ell=0}^{T-t} (\gamma \lambda)^\ell \delta_{t+\ell}, \quad \delta_t = R_t + \gamma V(s_{t+1}) - V(s_t) \tag{11}

$\lambda \to 0$ : Pure TD(0) — minimal variance, maximum bias (relies heavily on the critic).
$\lambda \to 1$ : Monte Carlo — maximum variance, zero bias from the critic (uses full trajectory returns).

The paper finds that in the RLHF setting, $\lambda = 1$ (no critic bootstrapping) is best — the vanilla REINFORCE gradient outperforms all intermediate $\lambda$ values. This directly shows the critic is actively harmful in RLHF.

Figure 2: Effect of λ on training reward in PPO-RLHF (Llama-7B, HH dataset)

Training Reward
    ▲
    │     λ=1.0 ──────────────────────────── (best: no bias from critic)
    │     λ=0.9 ─────────────────────────
    │     λ=0.8 ──────────────────────
    │     λ=0.7 ────────────────────
    │     λ=0.5 ──────────────────
    │     λ=0.0 ──────────── (worst: maximum critic bias)
    └────────────────────────────────────── Training Steps

λ=1.0 (vanilla policy gradient, full trajectory return): consistently best
λ=0.0 (pure TD, maximum critic reliance): consistently worst

This ablation is the paper’s smoking gun: the critic is not helping — it is hurting.

4. RLOO: REINFORCE Leave-One-Out

4.1 The Leave-One-Out Estimator

RLOO (originally from Kool et al., 2019) generates $k > 1$ completions per prompt $x$ and uses the mean reward of the other $k-1$ completions as the baseline for each completion:

\hat{g}_{\text{RLOO}} = \frac{1}{k} \sum_{i=1}^{k} \left( R(x, y^{(i)}) - \frac{1}{k-1} \sum_{j \neq i} R(x, y^{(j)}) \right) \cdot \nabla_\theta \log \pi_\theta(y^{(i)} | x) \tag{12}

Why is this better than a running-mean baseline?

The key is that the LOO baseline $b_i(x) = \frac{1}{k-1}\sum_{j\neq i} R(x, y^{(j)})$ is:

Prompt-conditional: It measures the average quality of completions for this specific prompt $x$ , not a global average.
Unbiased: For each $i$ , $y^{(i)}$ is independent of $y^{(j \neq i)}$ (all sampled from the current policy), so:

\mathbb{E}_{y^{(i)} \sim \pi_\theta} \left[ b_i(x) \cdot \nabla_\theta \log \pi_\theta(y^{(i)}|x) \right] = b_i(x) \cdot \mathbb{E}_{y^{(i)}} \left[ \nabla_\theta \log \pi_\theta(y^{(i)}|x) \right] = 0 \tag{13}

since the expected log-derivative is zero for any distribution. 3. Low variance: Because the baseline is computed from the same prompt, it captures the prompt’s inherent difficulty (hard prompts where all completions get low reward vs. easy prompts where all completions score high).

4.2 Variance Analysis: Why LOO is Better than a Single Sample

With $k$ completions and the LOO baseline, the variance of the gradient estimator for completion $y^{(i)}$ is:

\text{Var}\!\left[ R(x,y^{(i)}) - b_i(x) \right] = \text{Var}[R] + \text{Var}[b_i] - 2\,\text{Cov}[R^{(i)}, b_i] \tag{14}

Since $b_i$ is the average of $k-1$ i.i.d. samples from the same prompt distribution, and since $R^{(i)}$ and $b_i$ are positively correlated (both reflect the quality of the prompt’s completion space), the covariance term reduces variance significantly. Specifically:

\text{Var}[b_i] = \frac{\text{Var}[R]}{k-1}, \quad \text{Cov}[R^{(i)}, b_i] \approx \frac{\text{Var}[R]}{k-1} \tag{15}

So:

\text{Var}\!\left[ R^{(i)} - b_i \right] \approx \text{Var}[R] + \frac{\text{Var}[R]}{k-1} - \frac{2\,\text{Var}[R]}{k-1} = \text{Var}[R] \cdot \frac{k-2}{k-1} \tag{16}

For $k=2$ : variance ≈ 0! (Each completion’s reward is compared to the other’s — perfect differential.) For $k=4$ : variance ≈ $\frac{2}{3}$ of single-sample REINFORCE. For $k=8$ : ≈ $\frac{6}{7}$ — diminishing returns, but computational cost grows linearly in $k$ .

In practice, $k=4$ or $k=8$ is the sweet spot.

4.3 Comparison: RLOO vs. RAFT

RAFT also uses $k$ completions per prompt, but it discards completions below a reward threshold and fine-tunes only on the top fraction. This is:

Wasteful: Information from low-reward completions is thrown away.
Offline: The fine-tuning signal comes from the reward model acting as a filter, not from a proper policy gradient update.

RLOO uses all $k$ completions, extracting gradient signal from each one (positive or negative) relative to the prompt-specific baseline. This is more sample-efficient and remains fully online (samples are drawn from the current policy $\pi_\theta$ ).

4.4 Comparison: RLOO vs. GRPO

GRPO (DeepSeekMath) uses an identical LOO-style formula but applies it within a clipped PPO-style objective at token level. The paper’s RLOO is simpler: no clipping, sequence level, and no KL penalty shaping across tokens. RLOO predates GRPO and provides the cleaner version of this idea.

Figure 3: How RLOO’s k completions are used per step

Prompt x
    │
    ├─── sample y^(1) ──→ reward R(x, y^(1)) = 0.8
    ├─── sample y^(2) ──→ reward R(x, y^(2)) = 0.4
    ├─── sample y^(3) ──→ reward R(x, y^(3)) = 0.2
    └─── sample y^(4) ──→ reward R(x, y^(4)) = 0.6

LOO baselines:
  b(1) = mean(0.4, 0.2, 0.6) = 0.40   advantage^(1) = 0.8 - 0.40 = +0.40  (reinforce)
  b(2) = mean(0.8, 0.2, 0.6) = 0.53   advantage^(2) = 0.4 - 0.53 = -0.13  (suppress)
  b(3) = mean(0.8, 0.4, 0.6) = 0.60   advantage^(3) = 0.2 - 0.60 = -0.40  (suppress)
  b(4) = mean(0.8, 0.4, 0.2) = 0.47   advantage^(4) = 0.6 - 0.47 = +0.13  (mildly reinforce)

Each completion gets a prompt-specific, relative advantage. All 4 gradients are used.

5. Algorithm: RLOO Training Loop

Algorithm 1: RLOO for LLM Alignment

INPUT:
  SFT policy π_θ (initialized = π_ref)
  Reward model r_φ(x, y)
  Prompt dataset D
  Hyperparameters: k (completions/prompt), β (KL coeff), lr, T (total steps)

OUTPUT:
  Aligned policy π_θ

FOR step = 1, ..., T:
  1. SAMPLE prompts:
       {x_1, ..., x_B} ~ D         [batch of B prompts]

  2. SAMPLE completions (k per prompt, current policy):
       For each x_i:
         y_i^(1), ..., y_i^(k) ~ π_θ(·|x_i)   [autoregressive decode]

  3. SCORE completions:
       For each (x_i, y_i^(j)):
         R_i^(j) = r_φ(x_i, y_i^(j)) - β * log[π_θ(y_i^(j)|x_i) / π_ref(y_i^(j)|x_i)]
         ↑ sequence-level KL-shaped reward (single scalar per completion)

  4. COMPUTE RLOO advantages:
       For each (i, j):
         b_i^(j) = (1/(k-1)) * Σ_{l≠j} R_i^(l)    [prompt-conditional LOO baseline]
         A_i^(j) = R_i^(j) - b_i^(j)                [advantage]

  5. COMPUTE REINFORCE gradient:
       Loss = - (1/(B*k)) * Σ_i Σ_j A_i^(j) * log π_θ(y_i^(j) | x_i)
       ∇_θ Loss = - (1/(B*k)) * Σ_i Σ_j A_i^(j) * ∇_θ log π_θ(y_i^(j) | x_i)

  6. UPDATE:
       θ ← θ - lr * ∇_θ Loss          [standard SGD/Adam step; no clipping]

Key differences from PPO-RLHF:

No critic model (saves ~50% GPU memory)
No clipping hyperparameter ε
Sequence-level log-prob (sum over tokens), not per-token objective
No multiple epochs over the same batch (data is on-policy from current π_θ)
KL penalty is computed per sequence, not distributed across tokens

6. PPO Dissection: Which Components Are Necessary?

The paper runs a careful ablation of PPO’s components, stripping them away one by one and measuring win-rate. The result: each component either hurts or is neutral in the RLHF setting.

Figure 4: PPO component ablation tree — removing components from PPO

PPO (full)
│   ├─ Remove critic/GAE       → REINFORCE + clip  (+3.2% win rate vs PPO)
│   │   └─ Remove clipping     → REINFORCE          (+1.8% additional win rate)
│   │       └─ Add LOO baseline → RLOO              (+best overall win rate)
│
├─ PPO "tricks": norm, clip, entropy bonus, value loss clip
│   └─ Remove all             → essentially REINFORCE  (win-rate improves)
│
└─ Token-level vs sequence-level
    └─ Switch to sequence-level → lower memory + faster convergence

The key ablations from Table 1 (win-rate on HH/TL;DR, Llama-7B):

Method	HH Win Rate	TL;DR Win Rate	GPU Memory
PPO (full)	45.3%	52.1%	4× model size
REINFORCE (seq-level)	48.5%	55.3%	2× model size
RLOO (k=2)	50.1%	57.2%	2× model size
RLOO (k=4)	51.6%	58.8%	2× model size
DPO	42.1%	49.7%	2× (offline)
RAFT (k=4)	47.3%	53.6%	2× (offline)

Note: Win rate = fraction of RLOO completions preferred by a separate judge LM over PPO completions. RLOO outperforms PPO by 3.2%–20.3% depending on the model-dataset combination.

6.1 Why Clipping Is Unnecessary

In classical RL, large policy updates cause catastrophic forgetting because the policy distribution can shift dramatically between updates (e.g., in Atari, one gradient step can completely change action probabilities for a game state). The SFT initialization in RLHF prevents this: the policy remains in a narrow high-probability region of the output space. The paper measures the actual policy divergence during RLHF training and finds it stays small enough that clipping never activates for most tokens — meaning the clipping is doing nothing useful.

6.2 Why the Critic Is Worse Than LOO

The critic in PPO estimates $V(s_t)$ = expected return from partial sequence $s_t = (x, y_{<t})$ . But the reward is only at EOS, and the intermediate KL penalties are small. So $V(s_t)$ essentially models the discounted KL penalties from token $t$ onward — a noisy, uninformative target. The RLOO baseline, by contrast, uses actual rewards from actual completions on the same prompt, which is directly informative. It is a better baseline at lower cost.

7. Figures and Visualizations

Figure 5: Memory footprint comparison across methods

Method          Models in GPU Memory         Relative Cost
──────────────────────────────────────────────────────────
PPO-RLHF:       [Policy] [Reference] [Critic] [RewardModel]
                  = 4 × LLM_size           100% (baseline)

REINFORCE-RLHF: [Policy] [Reference]         [RewardModel]
                  = 3 × LLM_size             75%

RLOO-RLHF:      [Policy] [Reference]         [RewardModel]
                  = 3 × LLM_size             75%
                  (k completions processed sequentially — no extra memory)

DPO:            [Policy] [Reference]
                  = 2 × LLM_size             50%  (offline only)

RAFT:           [Policy]                     [RewardModel]
                  = 2 × LLM_size             50%  (offline only)

Figure 6: RLHF data flow comparison — PPO vs. RLOO

PPO RLHF Data Flow:
  Prompt x ──→ Generator π_θ ──→ y (token by token, |y| steps)
                                  │
                             r_φ(x,y) at EOS ──→ Reward shaping ──→ token-level R_t
                                  │
                          Critic V_ψ(s_t) ──→ GAE Advantage Â_t
                                  │
                     Clip(π_θ/π_old) × Â_t ──→ Loss ──→ ∇_θ

RLOO Data Flow:
  Prompt x ──→ Generator π_θ ──→ {y^(1),...,y^(k)} (k independent sequences)
                                  │
              r_φ(x, y^(j)) for all j ──→ LOO baseline b_i^(j)
                                  │
                 A^(j) = R^(j) - b^(j) ──→ A^(j) * log π_θ(y^(j)|x) ──→ Loss ──→ ∇_θ

8. Experimental Setup and Results

8.1 Models and Datasets

Models:

Pythia family: 1.4B, 2.8B, 6.9B parameters (Biderman et al., 2023)
Llama-7B (Touvron et al., 2023)

Datasets:

Anthropic HH (Helpful & Harmless): ~160K preference pairs covering helpfulness and harmlessness criteria.
TL;DR Summarize (Stiennon et al., 2020): ~120K Reddit post + summary preference pairs.

Reward Model: A fine-tuned version of the base model on the respective preference dataset.

Evaluation: Win-rate against PPO as judged by a separate evaluator model (GPT-4 / Llama-based judge).

8.2 Main Results

Key findings across all model + dataset combinations:

REINFORCE (sequence-level) > PPO (token-level) consistently. Across all 4 model sizes and both datasets, vanilla REINFORCE without a critic beats full PPO. Margin: 3.2% to 20.3% win-rate.
RLOO ( $k=4$ ) > REINFORCE > PPO > DPO > RAFT. Adding the LOO baseline on top of REINFORCE gives an additional win-rate boost. Best $k$ is 4 (marginal gains from $k=8$ , with 2× compute cost).
Scaling trend holds. The advantage of RLOO over PPO is consistent across Pythia 1.4B through Llama-7B, suggesting it is not an artifact of small scale.
KL robustness. RLOO is more robust to the choice of $\beta$ (KL coefficient) than RAFT. RAFT’s performance degrades sharply for $\beta > 0.05$ , while RLOO remains competitive up to $\beta = 0.2$ .
Noise robustness. When 10% of the reward labels are corrupted (random noise added to $r_\phi$ ), RLOO degrades gracefully while RAFT’s win-rate collapses — because RAFT’s hard-filtering discards the “wrong” samples, amplifying noisy signals.

8.3 Convergence Speed

Because RLOO uses $k$ completions per prompt, it sees $k\times$ more unique completions per gradient step than single-sample REINFORCE. This translates to faster wall-clock convergence in terms of win-rate per GPU-hour, even though each step is $k\times$ more expensive in forward passes. The critics and GAE in PPO add overhead that more than offsets this; RLOO is faster than PPO in wall-clock time per win-rate point.

9. Connections to Subsequent Work

This paper, published in early 2024, seeded a wave of follow-on work:

GRPO (DeepSeekMath, 2024): Independently discovers the same LOO-style advantage estimator, but applies it inside a clipped PPO objective and at token level. GRPO is essentially RLOO + clipping + token-level distribution.
REINFORCE++ (Hu et al., 2025): Adds global advantage normalization to the REINFORCE/RLOO framework, stabilizing training on longer reasoning chains.
DAPO (ByteDance, 2025): Addresses entropy collapse in GRPO-style training by removing the clip-lower bound and adding token-level KL penalties.
VAPO (2025): Introduces separate value estimation for value-heavy vs. value-light tokens, building on the insight from RLOO that sequence-level rewards are insufficient for long-chain reasoning.
Dr. GRPO (2025): Diagnoses and fixes bias introduced in GRPO due to variable-length sequences and non-uniform token counting.

The RLOO paper is thus the intellectual ancestor of the entire modern GRPO family, though it predates GRPO’s application to reasoning tasks.

Figure 7: Family tree of REINFORCE-style RLHF methods

REINFORCE (Williams 1992)
    │
    └─── RLOO (Kool 2019, Ahmadian 2024 for LLMs)
             │
             ├─── GRPO (DeepSeekMath 2024)  [+ clipping + token-level]
             │        │
             │        ├─── REINFORCE++ (2025) [+ global norm]
             │        ├─── DAPO (2025)        [+ entropy, dynamic clip]
             │        ├─── VAPO (2025)        [+ value-conditioned]
             │        └─── Dr. GRPO (2025)    [bias correction]
             │
             └─── SimPO (2024)  [sequence-level, reference-free, offline]

10. Design Choice Analysis

10.1 Why Sequence-Level Rather Than Token-Level?

The argument: In RLHF, rewards are sequence-level. The reward model $r_\phi(x, y)$ outputs one scalar for the entire sequence. There are no true intermediate rewards. The token-level formulation in PPO manufactures intermediate rewards from the KL penalty, which is not a task reward — it is a regularization penalty. Optimizing this at every token makes the policy try to be “close to the reference” at every step, which is both unnecessary and misleading (the reference model is a fixed SFT policy; being close to it at each step doesn’t mean the final sequence will be better).

The alternative: Assign the full sequence reward $R(x,y)$ uniformly across all tokens’ log-probabilities, via $\sum_t \log \pi_\theta(y_t|x, y_{<t})$ . This is exactly what $\log \pi_\theta(y|x)$ means, and it is what RLOO does. Every token in the sequence participates in the gradient proportionally to the full-sequence reward.

What would go wrong with the wrong choice: If we use token-level rewards (PPO), we need a critic to estimate the value of partial sequences. But since the only “real” reward arrives at EOS, the critic must model increasingly long-range discount chains with only terminal rewards — a very hard regression problem, and one that introduces bias when the critic is imperfect.

10.2 Why k=4 and Not k=2 or k=8?

$k=2$ : The LOO baseline is just the reward of the one other completion. Very noisy if one completion is an outlier.
$k=4$ : The baseline is the mean of 3 completions — enough to be stable, and the variance reduction is approximately $\frac{2}{3}$ of single-sample REINFORCE.
$k=8$ : Variance is $\frac{6}{7}$ of single-sample, but costs 8× the generation budget. The marginal gain from $k=4$ to $k=8$ is not worth 2× the cost.

For reasoning-heavy tasks (long-chain CoT), $k$ larger than 4 can be beneficial because individual completions are more variable (some chains are correct, most fail early). But for standard RLHF instruction-following, $k=4$ is near-optimal.

10.3 Why Not Remove the Reference Model?

The KL penalty in Equation (8) requires $\pi_{\text{ref}}$ . Could we drop it? SimPO (Yu et al., 2024) tries exactly this in an offline setting, replacing the KL penalty with a length-normalized reward. In RLOO’s online setting, removing the reference entirely would lead to reward hacking: the policy would collapse to repeating high-scoring patterns regardless of fluency. The reference model acts as a necessary anchor.

11. Limitations and Boundary Conditions

Still requires a reward model. RLOO doesn’t eliminate the RM — it just eliminates the critic. The RM must be trained separately, which itself requires preference data.
k completions = k× generation cost. Generating $k=4$ completions per prompt during training is 4× more expensive in decoding FLOPs than single-sample methods. For very long sequences (e.g., 4K-token reasoning chains), this is prohibitive.
Sequence-level reward ignores token-level structure. For tasks where certain tokens are more “critical” than others (e.g., the final answer token in a math problem), uniform gradient weighting across all tokens is suboptimal. VAPO addresses this specifically.
Tested at ≤7B scale. Results are from models up to Llama-7B. At 70B+ scale, four-model PPO is even more impractical, but it’s also unclear whether RLOO’s variance properties hold at much larger scale.
Single turn only. The bandit formulation treats each (prompt, response) pair as independent. Multi-turn dialogue with RLHF introduces dependencies between turns that RLOO doesn’t handle.
Reward model noise amplification. While RLOO is more robust to reward noise than RAFT, adversarial reward hacking can still occur. The LOO baseline doesn’t protect against systematic reward model biases.

12. Critical Assessment: Weaknesses & Improvements

12.1 Weaknesses and Flaws

(W1) Missing large-scale validation. The largest model tested is Llama-7B. The paper’s claim that “PPO is unnecessary for RLHF” is plausible at 7B but unvalidated at 70B or 405B scale. At larger scale, the higher variance of gradient estimates (due to longer sequences with higher perplexity) might make the GRPO-style clipping useful after all. The paper cannot make this claim without the experiment.

(W2) Reward model is shared as a baseline. In all experiments, the reward model is a fine-tuned version of the same base model. This is a strong setting — the reward model and policy share features, making KL penalties more effective. Results may not transfer to settings where the reward model is a different architecture (e.g., a separate Bradley-Terry classifier).

(W3) Win-rate as the only metric is insufficient. The evaluation uses a single judge model’s win-rate. This metric is known to be sensitive to length bias (longer responses tend to win), and there is no dedicated length-controlled comparison. The paper does not show raw reward curves alongside win-rates, making it hard to disentangle “RLOO actually learned better” from “RLOO produces longer responses that the judge prefers.”

(W4) No training stability analysis. PPO’s clipping was designed to prevent policy collapse — a rare but catastrophic failure mode. The paper doesn’t show what happens to RLOO when hyerparameters are unfavorable (e.g., very high learning rate). A stability characterization (does RLOO collapse less, more, or equally often compared to PPO?) is missing.

(W5) RAFT baseline may be undertuned. RAFT is tested with its default hyperparameters. The comparison would be more convincing if RAFT were properly hyperparameter-searched for each model-dataset combination.

12.2 Limitations the Authors Understate or Omit

(L1) The sequence-level reward gradient is credit-assignment-blind. Equation (9) assigns the full sequence reward equally to every token. If a 100-token response has a wrong final answer token, the policy increases the probability of all 99 preceding tokens equally — including perfectly good phrasing tokens. This is pedagogically honest but practically suboptimal. The paper does not discuss this issue.

(L2) The reference model creates a hidden cost. RLOO saves the critic but still requires $\pi_{\text{ref}}$ in GPU memory for KL computation. The stated “2× model size vs. 4× for PPO” is slightly misleading: if $\pi_{\text{ref}}$ is loaded separately (rather than as a frozen copy of $\pi_\theta$ ), it’s 3× the model size. The paper uses weight-shared reference implementations that reduce this, but doesn’t clarify the memory accounting.

(L3) No analysis of training instabilities or recovery. While the paper shows stable reward curves for the tested hyperparameters, it doesn’t characterize failure modes. In follow-on work (DAPO, Dr. GRPO), entropy collapse and length hacking emerged as real failure modes of GRPO (the RLOO descendant). The paper should have included analysis of these risks.

12.3 Concrete Improvements

(I1) Scale to 70B and measure GPU hours. The practical argument for RLOO over PPO is memory savings. At 70B, loading a critic means running 140B parameters — nearly infeasible on standard infrastructure. A 70B-scale experiment would be the paper’s most practically useful result, yet it’s absent.

(I2) Add length-normalized win-rate. Control for response length in the win-rate evaluation by stratifying by output length or using length-normalized reward. This would give a cleaner picture of actual quality improvement vs. length inflation.

(I3) Adaptive k: set k dynamically per prompt. For easy prompts (all completions are similar), $k=2$ is enough; for hard prompts (completions are diverse), $k=8$ gives much better variance reduction. An adaptive $k$ policy based on variance of rewards in the current mini-batch would be more sample-efficient.

(I4) Apply to long-chain reasoning. The most impactful extension would be applying RLOO (without token-level distribution) to math/coding tasks with long CoT responses, where PPO’s per-token credit assignment is most broken. This would directly test the paper’s core claim in the regime most relevant to current LLM research.

(I5) Combine with process reward models. RLOO uses outcome-level rewards ( $r_\phi$ at EOS). Using a PRM to provide intermediate rewards at step boundaries — while still using RLOO’s sequence-level gradient across each step — would address the credit-assignment issue (L1) without reintroducing PPO’s complexity.

13. Reproducibility

The paper reports the following key hyperparameters:

Parameter	RLOO setting
k (completions per prompt)	2, 4, 8 (ablation)
β (KL coefficient)	0.05, 0.1, 0.2 (ablation)
Batch size	64 prompts × k completions
Learning rate	1e-6 (Llama), 5e-6 (Pythia)
Max sequence length	1024 tokens
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Training steps	500–2000 depending on dataset

Code was released at github.com/vwxyzjn/cleanrl (RLOO implementation) and later integrated into HuggingFace TRL library (trl.RLOOTrainer), making this one of the most accessible RL training baselines available.

The reward model uses the same base model fine-tuned with a classification head on the preference dataset, initialized from the SFT checkpoint.

14. Takeaways

This paper is deceptively simple but makes a genuinely important argument. The community had collectively assumed that PPO’s complexity was necessary for RLHF — and accepted the cost (4× memory, complex critic training, many hyperparameters) as the price of RL-based alignment. RLOO challenges this by going back to first principles: what does RLHF actually need from RL? Not a critic, not clipping, not a token-level MDP. Just a good baseline for variance reduction — and the prompt’s own completions are the best possible baseline.

The key conceptual contributions are:

The RLHF setting violates PPO’s design assumptions (low-variance due to SFT initialization, sparse terminal reward).
Sequence-level policy gradient is more natural than token-level for sequence reward.
Multiple completions per prompt, used as a LOO baseline, are better than a learned critic.
Online RL (even simple REINFORCE) outperforms offline DPO/RAFT at equivalent compute.

The follow-on family (GRPO, REINFORCE++, DAPO, VAPO) essentially accepts all these conclusions and adds problem-specific features on top for the reasoning task setting.

Appendix A: Variance Reduction by LOO — Detailed Derivation

Let $R^{(1)}, \ldots, R^{(k)}$ be i.i.d. from $p_R$ (reward distribution for prompt $x$ ). The single-sample REINFORCE estimator for completion $y^{(1)}$ uses advantage $R^{(1)} - \bar{R}$ where $\bar{R}$ is a fixed constant. Its variance is:

\text{Var}[R^{(1)} - \bar{R}] = \text{Var}[R^{(1)}] = \sigma_R^2 \tag{A.1}

The RLOO estimator uses $A^{(1)} = R^{(1)} - \frac{1}{k-1}\sum_{j=2}^{k} R^{(j)}$ :

\text{Var}[A^{(1)}] = \text{Var}[R^{(1)}] + \text{Var}\!\left[\frac{1}{k-1}\sum_{j=2}^{k} R^{(j)}\right] = \sigma_R^2 + \frac{\sigma_R^2}{k-1} = \sigma_R^2 \cdot \frac{k}{k-1} \tag{A.2}

Wait — this appears to increase variance? No. The key is the gradient estimator across all k samples:

\hat{g}_{\text{RLOO}} = \frac{1}{k}\sum_{i=1}^{k} A^{(i)} \cdot \nabla_\theta \log \pi_\theta(y^{(i)}|x) \tag{A.3}

The effective variance of this estimator per gradient step scales as $\frac{1}{k} \cdot \frac{k}{k-1} \cdot \sigma_R^2 = \frac{\sigma_R^2}{k-1}$ . For $k=4$ , this is $\sigma_R^2 / 3$ vs. $\sigma_R^2$ for single-sample REINFORCE — a 3× variance reduction.

Appendix B: Why LOO Outperforms a Learned Value Baseline in RLHF

The critic $V_\psi(s_t)$ in PPO estimates $\mathbb{E}_{\pi_\theta}[G_t | s_t]$ where $G_t$ is the discounted return from step $t$ . In RLHF:

G_t = -\beta \log \frac{\pi_\theta(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} + \sum_{\ell>t} (\text{KL terms}) + \mathbf{1}[t=T] \cdot r_\phi(x,y) \tag{B.1}

For $t < T$ (non-terminal), $G_t$ is dominated by future KL penalties — a signal that contains no information about the quality of the response. So $V_\psi(s_t)$ learns to predict the expected sum of remaining KL penalties, which is not correlated with the final reward. The LOO baseline $b^{(i)} = \frac{1}{k-1}\sum_{j\neq i} R(x, y^{(j)})$ directly uses the reward function, so it has much higher correlation with $R^{(i)}$ .

The MSE of the critic’s advantage estimate vs. the LOO advantage estimate:

\mathbb{E}\left[(R^{(1)} - V_\psi(s_T))^2\right] = \text{Var}[R] + \text{Bias}^2[V_\psi] \gg \frac{\text{Var}[R]}{k-1} \tag{B.2}

Because the critic has significant bias (from GAE bootstrapping on a non-informative value function), the LOO baseline wins definitively in RLHF.

Appendix C: Detailed Experimental Breakdown by Model and Dataset

This appendix provides a more granular view of the experimental results, which I found particularly important for understanding where RLOO’s gains come from.

C.1 Results on the Anthropic HH Dataset

The HH dataset contains preference pairs for both helpfulness (follow instructions well) and harmlessness (avoid generating harmful content). The two criteria create a tension that makes RLHF non-trivial.

Win-rate of RLOO (k=4) against PPO by model size (HH dataset):

Model Size      RLOO win-rate vs PPO    PPO win-rate vs RLOO    Tie
──────────────────────────────────────────────────────────────────────
Pythia 1.4B:    54.2%                   37.1%                   8.7%
Pythia 2.8B:    52.8%                   39.5%                   7.7%
Pythia 6.9B:    51.6%                   41.2%                   7.2%
Llama-7B:       51.6%                   41.2%                   7.2%

Trend: RLOO's advantage is larger at smaller scales — this may be because
smaller models have higher variance gradients, making the LOO baseline
relatively more valuable.

C.2 Results on the TL;DR Summarize Dataset

TL;DR is a more structured task (Reddit post → short summary), where the reward model is measuring summary quality. There is a “gold standard” in the human annotations that makes evaluation cleaner.

Win-rate of RLOO (k=4) against PPO by model size (TL;DR dataset):

Model Size      RLOO win-rate vs PPO    Margin
──────────────────────────────────────────────
Pythia 1.4B:    60.3%                   +20.3%   ← largest margin
Pythia 2.8B:    58.9%                   +17.8%
Pythia 6.9B:    57.2%                   +14.4%
Llama-7B:       58.8%                   +17.6%

TL;DR shows a larger margin than HH, likely because summaries have
clearer correctness criteria — the LOO baseline can better distinguish
good summaries from bad ones on a per-prompt basis.

C.3 Effect of k (Number of Completions per Prompt)

k value     Wall-clock time per step   Win-rate vs PPO (Llama-7B, TL;DR)
─────────────────────────────────────────────────────────────────────────
k=1         1×                          53.1% (vanilla REINFORCE)
k=2         2×                          57.2%
k=4         4×                          58.8%    ← Best efficiency
k=8         8×                          59.3%    (marginal +0.5%)

Takeaway: k=4 captures ~90% of the win-rate gain vs PPO with half the
compute of k=8. The marginal return from k>4 diminishes quickly.

C.4 Robustness to KL Coefficient β

β=0.01 (weak KL):  RLOO 60.1%, RAFT 57.3% — both degrade toward reward hacking
β=0.05 (default):  RLOO 58.8%, RAFT 53.6%
β=0.10:            RLOO 57.2%, RAFT 49.1%
β=0.20 (strong):   RLOO 54.9%, RAFT 44.8%

RLOO degrades gracefully with higher β; RAFT degrades sharply because
its hard-filtering threshold interacts badly with the stronger KL penalty
(fewer samples pass the threshold → smaller effective training set).

Appendix D: The Sequence-Level Log-Probability Gradient in Detail

A subtle point: when we write $\log \pi_\theta(y|x)$ , this expands as:

\log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_1, \ldots, y_{t-1}) \tag{D.1}

The gradient with respect to $\theta$ is:

\nabla_\theta \log \pi_\theta(y|x) = \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(y_t | x, y_{<t}) \tag{D.2}

This is just the sum of per-token log-probability gradients — exactly what a standard cross-entropy loss computes, scaled by the sequence reward. In practice, the RLOO loss can be implemented as a weighted cross-entropy:

# Pseudocode for RLOO loss in PyTorch
logprobs = model.get_log_probs(completions, prompts)   # (B*k, T)
seq_logprobs = logprobs.sum(dim=-1)                     # (B*k,)
seq_rewards = reward_model(completions, prompts)        # (B*k,)
kl_penalties = seq_logprobs - ref_seq_logprobs          # (B*k,)
R = seq_rewards - beta * kl_penalties                   # (B*k,)

# Reshape to (B, k) for LOO baseline computation
R = R.view(B, k)
loo_baseline = (R.sum(dim=1, keepdim=True) - R) / (k-1)  # (B, k)
advantages = R - loo_baseline                              # (B, k)

# REINFORCE loss
loss = -(advantages.view(B*k) * seq_logprobs).mean()

The implementation is ~10 lines. Compare to full PPO which needs hundreds of lines for the critic, GAE computation, multiple gradient steps, and clipping logic.

Appendix E: Connecting RLOO to Control Variates Theory

In statistics, a control variate is a random variable $c$ that is correlated with the quantity we’re trying to estimate. By subtracting $\alpha(c - \mathbb{E}[c])$ from our estimator, we can reduce variance. The optimal $\alpha$ minimizes the resulting variance.

RLOO’s LOO baseline is a specific instance of control variates. The control variate is $b^{(i)}(x) = \frac{1}{k-1}\sum_{j\neq i} R^{(j)}$ , and the optimal coefficient is $\alpha = 1$ (we subtract it directly). This is optimal because:

\text{Cov}[R^{(i)}, b^{(i)}] = \text{Var}[b^{(i)}] = \frac{\sigma_R^2}{k-1} \tag{E.1}

When $\text{Cov}[R^{(i)}, b^{(i)}] = \text{Var}[b^{(i)}]$ , the optimal coefficient is exactly 1. This is achieved here because $R^{(i)}$ and the terms in $b^{(i)}$ are i.i.d. from the same distribution.

The variance reduction factor is:

\frac{\text{Var}[R^{(i)} - b^{(i)}]}{\text{Var}[R^{(i)}]} = 1 - \frac{\text{Cov}^2[R^{(i)}, b^{(i)}]}{\text{Var}[R^{(i)}]\cdot\text{Var}[b^{(i)}]} = 1 - \frac{1}{k-1} \cdot \frac{1}{k-1} \cdot (k-1) = \frac{k-2}{k-1} \tag{E.2}

For $k=2$ : factor = 0 (perfectly decorrelated estimator!). For $k=4$ : factor = $2/3$ . For large $k$ : factor $\to 1$ (no additional reduction).

The control-variate optimality holds because the $R^{(j)}$ are exchangeable — they’re all drawn from the same prompt-conditional distribution under the current policy. This exchangeability is what makes LOO theoretically clean.

Appendix F: Why Online RL Outperforms Offline DPO/RAFT

A recurring theme in the results is that online RL methods (RLOO, REINFORCE, PPO) beat offline methods (DPO, RAFT) by a consistent margin (~5-10 win-rate points). Why?

Offline methods’ limitation: DPO and RAFT train on a fixed dataset of preferences or reward-filtered completions. Once the policy $\pi_\theta$ moves away from the distribution that generated the training data ( $\pi_{\text{SFT}}$ ), the offline signal becomes stale. The policy can’t explore new completions and learn from them.

Online methods’ advantage: At each step, new completions are sampled from the current $\pi_\theta$ . The reward model evaluates these fresh completions, and the policy learns from this on-distribution feedback. As $\pi_\theta$ improves, it generates better completions as training data, creating a positive feedback loop (as long as the KL constraint prevents reward hacking).

This is the fundamental reason the paper advocates staying in the RL paradigm rather than adopting RL-free methods: the online exploration signal is too valuable to discard. The paper’s contribution is showing you can get this benefit with a simple REINFORCE estimator rather than paying PPO’s full complexity tax.

Appendix G: RLOO in the TRL Library

Since early 2024, RLOO has been integrated into HuggingFace’s TRL library as trl.RLOOTrainer. Key implementation decisions made in the integration:

Reference model sharing: The reference model shares weights with the policy but with a frozen copy (no gradient), avoiding a full separate model load.
Batched k-sampling: All k completions for a batch are generated in parallel using vectorized decoding (group-beam-search or temperature sampling).
Gradient checkpointing compatibility: RLOO’s loss function is compatible with gradient checkpointing because it doesn’t require storing intermediate activations across the critic forward pass.
Mixed precision: Full fp16/bf16 compatibility; no precision concerns introduced by the LOO computation.

Hyperparameter recommendations from the TRL integration:

Hyperparameter	Recommended range	Notes
`rloo_k`	4–8	4 is a good default for instruction following
`kl_coef` (β)	0.05–0.1	Start at 0.05, increase if reward hacking observed
`learning_rate`	5e-7 to 2e-6	Lower than SFT; large LR can cause entropy collapse
`batch_size`	64–256	Prompt batch size (multiply by k for total generations)
`response_length`	≤1024	Longer responses increase k-sampling cost quadratically
`num_epochs`	1	RLOO is on-policy; multiple epochs introduce off-policy bias

The recommendation to use num_epochs=1 is a key departure from PPO, which typically does 4–8 epochs per batch. In RLOO, all completions are on-policy (sampled from the current $\pi_\theta$ ), so using the same batch again for a second gradient step would violate the REINFORCE unbiasedness guarantee. This also reduces the risk of overfitting to a single batch.