REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Review date: 2026-06-02
Review author: Zhongzhu Zhou
Paper reviewed: REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Paper authors: Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen et al.
arXiv: 2501.03262 (v9, Nov 2025)
Venue: Preprint — widely adopted in OpenRLHF, TRL, Seed1.5-Thinking, DAPO, ScaleRL

Short Answer

REINFORCE++ identifies a fundamental flaw in GRPO-family critic-free RLHF algorithms: their per-prompt (local) advantage normalization is a theoretically biased estimator because the numerator (centered reward) and the denominator (local group standard deviation) are statistically dependent. The fix is elegantly simple — normalize advantages across the entire global training batch instead of per-prompt groups. This single change produces an effectively unbiased estimator whose bias vanishes as batch size grows, dramatically stabilizes KL divergence, prevents overfitting on small prompt sets, and outperforms both GRPO and full-critic PPO on complex agentic benchmarks while eliminating the expensive critic network entirely.

Prerequisites

Before diving into the technical contributions, let me build the conceptual scaffolding you need to understand what REINFORCE++ is solving.

What is RLHF and Why Do We Need It?

Language models trained by next-token prediction (pre-training) learn to model the distribution of internet text. That distribution includes harmful, incorrect, and unhelpful content alongside good content. Supervised fine-tuning (SFT) on curated instruction-following data helps, but it is limited by the quality and coverage of labeled examples.

Reinforcement Learning from Human Feedback (RLHF) instead trains a reward model R(x,y)R(x, y) — a neural network that scores how good a response yy is to prompt xx — and then uses reinforcement learning to update the language model policy πθ\pi_\theta to produce responses with high reward. The policy is rewarded for writing responses that the reward model judges as helpful, harmless, and honest.

RLHF has three key phases:

  1. Pre-training → Large language model π0\pi_0
  2. Reward model trainingR(x,y)R(x, y) trained on human preference pairs (yw,yl)(y_w, y_l) where ywy_w is preferred over yly_l, using the Bradley-Terry model: P(ywyl)=σ(R(x,yw)R(x,yl))P(y_w \succ y_l) = \sigma(R(x, y_w) - R(x, y_l))
  3. RL fine-tuning → Update πθ\pi_\theta to maximize E[R(x,y)]E[R(x, y)] subject to a KL divergence penalty that keeps the policy from straying too far from π0\pi_0

The RL objective is:

maxπθ  ExD,yπθ(yx)[R(x,y)βKL(πθ(x)πref(x))]\max_{\pi_\theta} \; E_{x \sim D, y \sim \pi_\theta(y|x)} [R(x, y) - \beta \cdot \text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x))]

The KL term prevents the policy from “reward hacking” — finding weird outputs that trick the reward model into giving high scores while departing far from coherent language.

The Policy Gradient Theorem

To understand advantage-based methods, we need the Policy Gradient Theorem (Williams, 1992; Sutton & Barto, 2018).

For a policy πθ\pi_\theta parameterized by θ\theta, the gradient of expected reward with respect to θ\theta is:

θJ(θ)=Eτπθ[t=1Tθlogπθ(otq,o<t)Gt](PG)\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(o_t | q, o_{<t}) \cdot G_t \right] \tag{PG}

where Gt=t=tTrtG_t = \sum_{t'=t}^{T} r_{t'} is the cumulative return from time tt onward (in LLM RLHF, the reward is only at t=Tt=T, so Gt=rTG_t = r_T for all tt).

The key identity used here is:

θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \cdot \nabla_\theta \log \pi_\theta(a|s)

This “log-derivative trick” converts the expectation into something estimable from samples — a fundamental building block of REINFORCE-family algorithms.

Variance reduction via baselines: A constant baseline bb can be subtracted from GtG_t without introducing bias (because Eπθ[θlogπθ(as)b]=0E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b] = 0):

θJ(θ)=Eτπθ[t=1Tθlogπθ(otq,o<t)(Gtb)]\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(o_t | q, o_{<t}) \cdot (G_t - b) \right]

The quantity (Gtb)(G_t - b) is the advantage AtA_t: how much better is the actual return compared to the baseline?

Proximal Policy Optimization (PPO)

PPO is the dominant RL algorithm for RLHF. It uses an Actor-Critic architecture: a policy network πθ\pi_\theta (the “actor”) and a value/critic network VϕV_\phi that estimates expected future returns from any state.

The PPO surrogate objective is:

LPPO(θ)=EqP(Q),  oπθold[1ot=1omin(st(θ)At,  clip(st(θ),  1ϵ,  1+ϵ)At)](1)\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_{q \sim P(Q),\; o \sim \pi_{\theta_{\text{old}}}}\left[\frac{1}{|o|}\sum_{t=1}^{|o|} \min\left(s_t(\theta) A_t,\; \text{clip}(s_t(\theta),\; 1-\epsilon,\; 1+\epsilon) A_t\right)\right] \tag{1}

where st(θ)=πθ(otq,o<t)πθold(otq,o<t)s_t(\theta) = \frac{\pi_\theta(o_t \mid q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t \mid q, o_{<t})} is the probability ratio.

The clip function limits updates: if the ratio sts_t is too far from 1 (policy changed too much), the objective is clipped to prevent overly large updates. This is the key PPO stabilization mechanism.

The advantage AtA_t is computed via Generalized Advantage Estimation (GAE) using the critic network VϕV_\phi:

Aq,otGAE=l=0(γλ)lδt+l(2)A_{q,o_t}^{\text{GAE}} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} \tag{2}

where δq,ot=rt+γV(ot+1)V(ot)\delta_{q,o_t} = r_t + \gamma V(o_{t+1}) - V(o_t) is the temporal difference error, γ(0,1]\gamma \in (0,1] is the discount factor, and λ[0,1]\lambda \in [0,1] is the bias-variance tradeoff parameter.

Intuition for GAE: λ=0\lambda = 0 gives a one-step TD estimate (low variance, high bias); λ=1\lambda = 1 gives the full Monte Carlo return (no bias, high variance); intermediate λ\lambda balances the two.

The catch: Training and running the critic network VϕV_\phi requires:

  • A separate model with roughly the same parameter count as the policy (doubling memory)
  • Additional forward passes for value estimation during rollout
  • Additional backward passes for critic updates
  • Significant engineering complexity for actor-critic synchronization in distributed training

At the 7B-70B scale typical of modern LLM RLHF, this overhead is substantial.

Critic-Free Methods: REINFORCE, ReMax, RLOO, GRPO

To avoid the critic’s overhead, a family of critic-free methods replaces GAE with direct reward-based advantage estimates.

REINFORCE (Williams, 1992): The vanilla policy gradient uses the total return minus a fixed baseline bb:

AtREINFORCE=R(o,q)bA_t^{\text{REINFORCE}} = R(o, q) - b

Where bb is typically the mean reward across the batch. Simple and unbiased (since bb is constant), but high variance since there’s no variance-reduction structure.

ReMax (Li et al., 2023): Generates one greedy decoding response o^\hat{o} per prompt as a learned baseline:

Aq,ot=R(o)R(o^)A_{q,o_t} = R(o) - R(\hat{o})

Intuition: Compare each sampled response to the greedy “best guess” response. If the sampled response is better than greedy, reinforce it; if worse, suppress it. Requires an extra greedy forward pass per prompt.

RLOO (Ahmadian et al., 2024): Samples kk responses per prompt, using the mean of the others as a leave-one-out baseline for each:

Aq,ot(i)=R(o(i))1k1jiR(o(j))A_{q,o_t^{(i)}} = R(o^{(i)}) - \frac{1}{k-1}\sum_{j \neq i} R(o^{(j)})

Intuition: Leave-one-out estimator. For k=2k=2, this is: “Is this response better than the other response to the same prompt?” Requires k2k \geq 2 responses per prompt.

GRPO (Shao et al., 2024): Used in DeepSeekMath and DeepSeek-R1. Samples kk responses per prompt and normalizes using group statistics:

Aq,ot(i)=R(o(i))mean{R(o(j))}j=1kstd{R(o(j))}j=1k+ϵ(3)A_{q,o_t^{(i)}} = \frac{R(o^{(i)}) - \text{mean}\{R(o^{(j)})\}_{j=1}^{k}}{\text{std}\{R(o^{(j)})\}_{j=1}^{k} + \epsilon} \tag{3}

GRPO adds z-score normalization on top of the RLOO mean-subtraction. This looks like it should normalize advantages to have zero mean and unit variance within each group. But as REINFORCE++ proves, this normalization is biased.

The Key Quantity: Advantage Estimation Quality

The quality of advantage estimation determines training stability and final policy performance:

  • Unbiased → Policy gradient points in the right direction on average
  • Low variance → Each gradient step is reliable; training converges faster
  • Global baseline → Policy improves against an absolute standard; prevents prompt-level overfitting
  • Numerical stability → Prevents gradient explosion when std→0

REINFORCE++ addresses all four properties simultaneously through global batch normalization.

What the Paper Does

REINFORCE++ (arXiv 2501.03262, v9 Nov 2025) makes three contributions:

  1. Theoretical proof (Appendix A) that GRPO’s per-prompt local normalization is a biased advantage estimator for any finite group size kk.
  2. REINFORCE++ algorithm: replaces local normalization with global batch normalization; supports both k=1 (general RLHF, maximum efficiency) and k>1 (reasoning tasks, same global normalization principle).
  3. REINFORCE++w/Baseline algorithm: for complex tasks benefiting from group sampling (k>1k > 1), adds a group-mean subtraction (reward reshaping) before global normalization, plus the theoretically sound k2k_2 KL estimator.

The paper provides empirical evidence across four domains: general RLHF (Chat-Arena-Hard), mathematical reasoning (AIME-24/25 overfitting experiment), OOD generalization (K&K logic puzzles), and complex tool-use agents (ZeroTIR multi-step math).

Method Deep Dive

The Three Flaws of Local Normalization

Figure 1: Architecture comparison — PPO vs. critic-free RLHF methods

flowchart LR
  subgraph PPO["PPO (Actor-Critic)"]
    A1[Policy Model\nπ_θ] -->|"sample o"| B1[Reward Model\nR]
    A1 --> C1["Critic Network V_φ\n(same size as actor)\n⚠️ 2× memory + compute"]
    B1 --> D1[GAE Advantage\nA_t from Eq.2]
    C1 --> D1
    D1 --> A1
  end

  subgraph GRPO["GRPO / RLOO\n(Local Norm)"]
    A2[Policy Model\nπ_θ] -->|"sample k responses\nper prompt"| B2[Reward Model\nR]
    B2 -->|"r_1...r_k\n(same prompt!)"| C2["Local Norm Eq.3\nA_i = (r_i - mean_group)\n         / std_group\n⚠️ Biased estimator!"]
    C2 --> A2
  end

  subgraph RF["REINFORCE++\n(Global Norm)"]
    A3[Policy Model\nπ_θ] -->|"sample 1+ responses\nper prompt"| B3[Reward Model\nR]
    B3 -->|"r for ALL prompts\nin the batch"| C3["Global Norm Eq.5\nA_i = (r_i - mean_batch)\n         / std_batch\n✅ Effectively unbiased"]
    C3 --> A3
  end

GRPO’s Equation (3) has three distinct problems:

Flaw 1 — Theoretical Bias: The advantage estimator Ai=(ϵiϵˉ)/DA_i = (\epsilon_i - \bar{\epsilon}) / D is biased because the numerator (centered reward) and the denominator DD (local group std) are not statistically independent — both depend on the same kk small-group rewards. I prove this in detail below.

Flaw 2 — Numerical Instability: When all kk responses to the same prompt receive similar rewards (common on easy prompts), the local std0\text{std} \to 0, causing the advantage to blow up numerically. This is a direct training instability trigger.

Flaw 3 — Prompt-Level Overfitting: The policy is optimized to be “better than other responses to the same prompt” rather than “globally better.” This encourages memorizing training prompts rather than learning transferable skills — catastrophic overfitting in low-data regimes.

Proof: GRPO’s Local Advantage Estimator is Biased

Let me walk through the complete bias proof from Appendix A of the paper. This is the core theoretical contribution.

Setup:

Assume we observe NN rewards for a single prompt. The true mean reward is θ\theta and rewards are:

ri=θ+ϵi,ϵiN(0,σ2),i=1,,N(i.i.d.)(P1)r_i = \theta + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2), \quad i = 1, \ldots, N \quad \text{(i.i.d.)} \tag{P1}

Define the sample mean, sample std, and advantage:

ϵˉ=1Nj=1Nϵj,D=1Nj=1N(ϵjϵˉ)2,Ai=ϵiϵˉD(P2)\bar{\epsilon} = \frac{1}{N}\sum_{j=1}^{N}\epsilon_j, \quad D = \sqrt{\frac{1}{N}\sum_{j=1}^{N}(\epsilon_j - \bar{\epsilon})^2}, \quad A_i = \frac{\epsilon_i - \bar{\epsilon}}{D} \tag{P2}

Theorem 1 (from Appendix A): For any finite N2N \geq 2:

E[Aiϵi]ϵiAi is a biased estimator.E[A_i \mid \epsilon_i] \neq \epsilon_i \quad \Longrightarrow \quad A_i \text{ is a biased estimator.}

Proof — Step 1: Numerator Bias

Rewrite the numerator (ϵiϵˉ)(\epsilon_i - \bar{\epsilon}):

ϵiϵˉ=ϵi1N(ϵi+jiϵj)=(11N)ϵi1Njiϵj\epsilon_i - \bar{\epsilon} = \epsilon_i - \frac{1}{N}\left(\epsilon_i + \sum_{j \neq i}\epsilon_j\right) = \left(1 - \frac{1}{N}\right)\epsilon_i - \frac{1}{N}\sum_{j \neq i}\epsilon_j

Taking the conditional expectation given ϵi\epsilon_i, using independence of ϵj\epsilon_j (jij \neq i) and E[ϵj]=0E[\epsilon_j] = 0:

E[ϵiϵˉϵi]=(11N)ϵi(P3)E[\epsilon_i - \bar{\epsilon} \mid \epsilon_i] = \left(1 - \frac{1}{N}\right)\epsilon_i \tag{P3}

So the numerator, in expectation, is a scaled version of ϵi\epsilon_i with scale factor (11/N)(1 - 1/N). For N=4N = 4 (typical GRPO group size), this scale is 3/43/4.

Proof — Step 2: Denominator Depends on ϵi\epsilon_i

By the identity D2=1Njϵj2ϵˉ2D^2 = \frac{1}{N}\sum_j \epsilon_j^2 - \bar{\epsilon}^2, we compute E[D2ϵi]E[D^2 \mid \epsilon_i].

Since ϵˉ=1N(ϵi+jiϵj)\bar{\epsilon} = \frac{1}{N}(\epsilon_i + \sum_{j \neq i}\epsilon_j):

E[ϵˉ2ϵi]=E[(ϵi+jiϵjN)2ϵi]=ϵi2N2+(N1)σ2N2E[\bar{\epsilon}^2 \mid \epsilon_i] = E\left[\left(\frac{\epsilon_i + \sum_{j \neq i}\epsilon_j}{N}\right)^2 \mid \epsilon_i\right] = \frac{\epsilon_i^2}{N^2} + \frac{(N-1)\sigma^2}{N^2}

(cross-terms vanish by independence and zero mean). And:

E[1Nj=1Nϵj2ϵi]=ϵi2+(N1)σ2NE\left[\frac{1}{N}\sum_{j=1}^{N}\epsilon_j^2 \mid \epsilon_i\right] = \frac{\epsilon_i^2 + (N-1)\sigma^2}{N}

Therefore:

E[D2ϵi]=ϵi2+(N1)σ2Nϵi2+(N1)σ2/NN2NE[D^2 \mid \epsilon_i] = \frac{\epsilon_i^2 + (N-1)\sigma^2}{N} - \frac{\epsilon_i^2 + (N-1)\sigma^2/N}{N^2}\cdot N

Simplifying:

E[D2ϵi]=(N1)2σ2N2+N1N2ϵi2(P4)\boxed{E[D^2 \mid \epsilon_i] = \frac{(N-1)^2 \sigma^2}{N^2} + \frac{N-1}{N^2}\epsilon_i^2} \tag{P4}

This is the key finding: E[D2ϵi]E[D^2 \mid \epsilon_i] explicitly contains the term N1N2ϵi2\frac{N-1}{N^2}\epsilon_i^2. Therefore, the denominator DD depends on ϵi\epsilon_i — they are not independent.

Proof — Step 3: Dependency implies bias

Since the numerator (ϵiϵˉ)(\epsilon_i - \bar{\epsilon}) and the denominator DD are both functions of the same NN-sample group and both depend on ϵi\epsilon_i, their correlation means:

E[ϵiϵˉDϵi]E[ϵiϵˉϵi]E[Dϵi]E\left[\frac{\epsilon_i - \bar{\epsilon}}{D} \mid \epsilon_i\right] \neq \frac{E[\epsilon_i - \bar{\epsilon} \mid \epsilon_i]}{E[D \mid \epsilon_i]}

(The ratio of expectations is not the expectation of the ratio when numerator and denominator are correlated — this is the classical ratio estimator bias.) \square

What the bias looks like qualitatively: For large ϵi|\epsilon_i| (a very good or very bad response relative to the group), the denominator DD grows with ϵi|\epsilon_i| (from Eq. P4), so the advantage Ai=(ϵiϵˉ)/DA_i = (\epsilon_i - \bar{\epsilon}) / D underestimates the true advantage for outlier responses. This systematically suppresses the gradient signal from the most informative samples.

Why global normalization fixes it: When normalizing over a batch of BNB \gg N prompts, the batch mean μbatch\mu_{\text{batch}} and std σbatch\sigma_{\text{batch}} are statistics of independent prompt samples, not of the same kk-sample group. As BB \to \infty, μbatchconst\mu_{\text{batch}} \to \text{const} and σbatchconst\sigma_{\text{batch}} \to \text{const} (independent of any single reward rir_i), and the bias vanishes:

E[Aiglobal]=E[riμbatchσbatch]BE[ri]μσ(unbiased)E[A_i^{\text{global}}] = E\left[\frac{r_i - \mu_{\text{batch}}}{\sigma_{\text{batch}}}\right] \xrightarrow{B \to \infty} \frac{E[r_i] - \mu}{\sigma} \quad \text{(unbiased)}

REINFORCE++: Global Advantage Normalization

Figure 2: REINFORCE++ training loop (k=1 variant)

sequenceDiagram
    participant D as Dataset D
    participant Policy as Policy Model π_θ
    participant Ref as Reference Model π_ref
    participant Reward as Reward Model R
    participant Norm as Global Normalizer

    loop Each Training Step
        D->>Policy: Sample B prompts {q_1...q_B}
        Policy->>Policy: Copy π_old ← π_θ
        Policy->>Reward: Generate ONE response o_i per q_i (under π_old)
        Reward->>Norm: Compute r_i = R(o_i, q_i) for each i
        Policy->>Norm: Compute KL(t) = log[π_ref(o_t|...) / π_old(o_t|...)] per token
        Norm->>Norm: A_i = r_i - β · ΣKL(t)   [Eq.4 — raw advantage]
        Norm->>Norm: μ = mean({A_i}_all_tokens), σ = std({A_i}_all_tokens)
        Norm->>Policy: A_norm = (A_i - μ) / (σ + ε)   [Eq.5 — global norm]
        Policy->>Policy: Update θ via PPO clip objective with A_norm
    end

The REINFORCE++ raw advantage for each token oto_t in response oo to prompt qq is:

Aq,ot=R(o1:T,q)βi=tTKL(i)(4)A_{q,o_t} = R(o_{1:T}, q) - \beta \cdot \sum_{i=t}^{T} \text{KL}^{(i)} \tag{4}

where KL(i)=logπθold(oiq,o<i)πref(oiq,o<i)\text{KL}^{(i)} = \log\frac{\pi_{\theta_{\text{old}}}(o_i \mid q, o_{<i})}{\pi_{\text{ref}}(o_i \mid q, o_{<i})} is the per-token KL divergence between the current (old) policy and the reference model, and β\beta is the KL penalty coefficient.

Key design choice: KL is incorporated directly into the reward (k1-style formulation), not added as a separate loss term. This makes the KL penalty part of the advantage signal rather than an additional gradient term, which simplifies the objective to a pure PPO update.

Then apply global batch normalization:

Aq,otnorm=Aq,otmean(AADbatch)std(AADbatch)+ϵ(5)A_{q,o_t}^{\text{norm}} = \frac{A_{q,o_t} - \text{mean}(A \mid A \in \mathcal{D}_{\text{batch}})}{\text{std}(A \mid A \in \mathcal{D}_{\text{batch}}) + \epsilon} \tag{5}

where Dbatch\mathcal{D}_{\text{batch}} contains all advantages for all tokens across all prompts in the current training batch. With B=1024B = 1024 prompts and average response length o512|o| \approx 512 tokens, we normalize over ~524,288524{,}288 advantage values — compared to GRPO’s 4 or 8 per-prompt values.

Algorithm 1: REINFORCE++ (k=1 case) — Fully annotated pseudocode

REINFORCE++ (k=1) Algorithm
────────────────────────────────────────────────────────────────────
Input:
  - π_ref : frozen reference policy (SFT model)
  - R     : reward model (Bradley-Terry or rule-based)
  - D     : training prompt dataset
Hyperparameters:
  - B     : batch size (number of prompts per step, e.g., 1024)
  - β     : KL penalty coefficient (e.g., 0.01–0.1)
  - ε     : normalization epsilon (e.g., 1e-8 for numerical stability)
  - ε_clip: PPO clipping threshold (e.g., 0.2)
  - k_ppo : number of PPO update iterations per rollout (e.g., 1)
  - M     : total training steps

Initialize: π_θ ← π_ref    (start from reference, train to maximize reward)

For step = 1, 2, ..., M:
  ┌─────────────── Rollout Phase ───────────────
  │ 1. Sample batch: {q_1, ..., q_B} ← sample B prompts from D
  │    (Note: sample WITHOUT REPLACEMENT for prompt diversity)

  │ 2. Snapshot old policy: π_old ← π_θ
  │    (Critical: must freeze π_old before generating rollouts)

  │ 3. For each q_i ∈ batch:
  │      o_i ~ π_old(· | q_i)    ← ONE response per prompt
  │      (k=1 means no group sampling — max prompt diversity)

  │ 4. Compute raw advantages:
  │      For each (q_i, o_i):
  │        r_i = R(o_i, q_i)                      ← reward for full response
  │        For each token position t = 1...|o_i|:
  │          KL_t = log π_ref(o_t|q_i,o<t) - log π_old(o_t|q_i,o<t)
  │          A_raw(q_i, o_t) = r_i - β × Σ_{s=t}^{T} KL_s   [Eq.4]
  │          (token-level advantage: future KL subtracted from reward)
  └─────────────────────────────────────────────
  
  ┌─────────── Global Normalization Phase ──────────
  │ 5. Collect all raw advantages into D_batch:
  │      D_batch = {A_raw(q_i, o_t) : all i, all t}
  │      (size ≈ B × avg_length, e.g., 1024 × 512 = 524K values)

  │ 6. Compute global statistics:
  │      μ = mean(D_batch)                ← scalar, stable for large B
  │      σ = std(D_batch)                 ← scalar, stable for large B

  │ 7. Normalize each advantage:
  │      A_norm(q_i, o_t) = (A_raw(q_i, o_t) - μ) / (σ + ε)   [Eq.5]
  └─────────────────────────────────────────────

  ┌─────────── PPO Update Phase ──────────────────
  │ 8. For ppo_iter = 1, ..., k_ppo:
  │      For each (q_i, o_t):
  │        ratio_t = π_θ(o_t|q_i,o<t) / π_old(o_t|q_i,o<t)
  │        L_t = min(ratio_t × A_norm(q_i,o_t),
  │                  clip(ratio_t, 1-ε_clip, 1+ε_clip) × A_norm(q_i,o_t))
  │      L_total = mean(L_t over all tokens and prompts)
  │      θ ← θ + α × ∇_θ L_total          ← gradient ascent (maximize reward)
  └─────────────────────────────────────────────
────────────────────────────────────────────────────────────────────

Why k=1 is sufficient for general RLHF: In general-domain chat alignment, the reward model is a continuous Bradley-Terry model (not a sparse 0/1 signal). Even a single response per prompt provides useful gradient signal because there is variance in the reward across different prompts. Moreover, sampling only one response per prompt means the batch covers BB distinct prompts rather than B/kB/k prompts with kk responses each — this maximizes prompt-level diversity, which is crucial for generalization.

REINFORCE++w/Baseline: For Complex Reasoning Tasks

When tasks have sparse 0/1 rewards (math correctness, code pass rate), many responses get identical reward = 0 (failed) or reward = 1 (passed). In this case, k=1 sampling is suboptimal because many steps have uninformative zero gradients. Sampling k>1k > 1 responses per prompt and using group-mean subtraction to filter void samples is beneficial.

Figure 3: Two-step advantage computation in REINFORCE++w/Baseline

flowchart TD
    A["Sample k responses per prompt:\no^(1), o^(2), ..., o^(k) ~ π_old"] --> B
    B["Compute rewards:\nR^(1), R^(2), ..., R^(k)"] --> C
    C["Step 1: Group Mean Subtraction\n(reward reshaping / void filtering)\n\nA'_i = R^(i) - mean_group(R^(1)...R^(k))\n\nIf all R^(j) are equal:\n  A'_i = 0 for all i → no gradient!\n  (void sample filtered)"] --> D
    D["Step 2: Global Batch Normalization\n(stability + unbiasedness)\n\nA_norm_i = (A'_i - mean_batch) / (std_batch + ε)\n\nNow using GLOBAL stats, not group stats\n→ fixes the GRPO bias"] --> E
    E["PPO update with A_norm_i\n+\nSeparate k2 KL loss: J_k2 = E[½(log π_θ/π_ref)²]"]

Step 1 — Group Mean Subtraction:

Aq,ot(i)=R(o(i))meangroup{R(o(j))}j=1k(6)A'_{q,o_t^{(i)}} = R(o^{(i)}) - \text{mean}_{\text{group}}\{R(o^{(j)})\}_{j=1}^{k} \tag{6}

This is simply subtracting the in-group mean. Its purpose is reward reshaping, not normalization:

  • Void sample filtering: If all kk group responses have reward = 0 (none solved the problem), then meangroup=0\text{mean}_{\text{group}} = 0 and Ai=0A'_i = 0 — the policy receives no gradient from this prompt, which is correct: we have no information about which direction to move the policy for a problem none of the responses solved.
  • Scale normalization: Rewards in [0,1][0, 1] and rewards in [1,+1][-1, +1] both produce centered advantages in the same range after group-mean subtraction — makes the algorithm work with both reward schemes.
  • Reward shaping: Transforms sparse absolute rewards into relative rewards within the group, providing a denser gradient signal.

Step 2 — Global Batch Normalization:

Aq,otnorm=Aq,otmeanbatch(A)stdbatch(A)+ϵ(7)A_{q,o_t}^{\text{norm}} = \frac{A'_{q,o_t} - \text{mean}_{\text{batch}}(A')}{\text{std}_{\text{batch}}(A') + \epsilon} \tag{7}

After group-mean subtraction, normalize using global batch statistics. This is the crucial fix over GRPO: we divide by stdbatch\text{std}_{\text{batch}} (computed over thousands of token-advantages from all prompts in the batch) rather than stdgroup\text{std}_{\text{group}} (computed over k=4k=4 or 88 values from a single prompt). The global std is orders of magnitude more stable.

The k2 KL Estimator:

REINFORCE++w/Baseline uses a separate KL penalty term in the loss rather than incorporating KL into the reward. It uses the k2k_2 estimator:

L=LPPO(Anorm)λJk2(θ),Jk2(θ)=E[12(logπθπref)2](8)\mathcal{L} = \mathcal{L}^{\text{PPO}}(A^{\text{norm}}) - \lambda \cdot J_{k_2}(\theta), \quad J_{k_2}(\theta) = E\left[\frac{1}{2}\left(\log\frac{\pi_\theta}{\pi_{\text{ref}}}\right)^2\right] \tag{8}

Why the k2k_2 estimator instead of GRPO’s k3k_3? The k3k_3 estimator used in GRPO is:

k3=πθπreflogπθπref1k_3 = \frac{\pi_\theta}{\pi_{\text{ref}}} - \log\frac{\pi_\theta}{\pi_{\text{ref}}} - 1

This is a first-order approximation to the reverse KL divergence KL(πrefπθ)\text{KL}(\pi_{\text{ref}} \| \pi_\theta). The problem: k3k_3 can be negative (when πθ/πref<\pi_\theta / \pi_{\text{ref}} < some threshold), producing unstable gradients. The k2k_2 estimator 12(logratio)2\frac{1}{2}(\log \text{ratio})^2 is:

  • Always non-negative ✓
  • Has bounded, smooth gradients ✓
  • Provides an unbiased gradient estimate for the reverse KL ✓
  • Behaves like a squared-error loss on the log-ratio, penalizing deviations proportionally

Relationship to PPO: A Simplified View

REINFORCE++w/Baseline is exactly PPO with the critic removed and GAE replaced by two-step global normalization:

Figure 4: Conceptual equivalence — PPO simplified to REINFORCE++w/Baseline

graph TD
    A["Full PPO"] --> A1["Actor: π_θ"]
    A --> A2["Critic: V_φ\n(same parameter count as actor)\nComputes: δ_t = r_t + γV(s_{t+1}) - V(s_t)\nGAE: A_t = Σ (γλ)^l δ_{t+l}"]
    A --> A3["PPO clip objective"]

    B["Simplification steps:"] --> B1["Remove critic: V_φ ← 0\nNow: δ_t = r_t (no bootstrapping)"]
    B1 --> B2["Set γ=1: no future discounting\n(reward only at end of response)"]
    B2 --> B3["Set λ=1: full MC return\nA_t = Σ_{l=0}^∞ δ_{t+l} = R (total reward)"]
    B3 --> B4["Replace fixed baseline with\nglobal batch normalization\nfor stability"]

    B4 --> C["REINFORCE++w/Baseline\n= PPO with V=0, γ=λ=1,\nglobal norm instead of learned value"]

This equivalence is illuminating: the critic network in PPO serves to estimate V(st)V(s_t), the expected future return from state sts_t. When the reward is only given at the end of the response (as in RLHF), the “state” at token tt is the partial response (q,o<t)(q, o_{<t}), and the value V(st)=E[R]V(s_t) = E[R] is approximately constant (equal to the expected final reward). So the critic’s contribution is essentially just estimating a mean baseline — which is exactly what global batch normalization achieves, more accurately, with far less compute.

Summary of Method Comparison

Figure 5: Side-by-side comparison of all critic-free methods

graph LR
    subgraph Methods["Critic-Free RLHF Methods"]
        direction TB
        R1["REINFORCE\nA_t = R - b (constant baseline)\nBias: No, Variance: High\nk: 1"]
        R2["ReMax\nA_t = R(o) - R(o_greedy)\nBias: No, Variance: Medium\nk: 1+1 (greedy extra pass)"]
        R3["RLOO\nA_t = R(o_i) - mean_others\nBias: Slight (finite k)\nVariance: Medium, k≥2"]
        R4["GRPO\nA_t = (R - mean_grp)/std_grp\nBias: YES (proven)\nInstability: std→0 risk\nk≥2, prompt overfitting"]
        R5["REINFORCE++\nA_t = (R - KL) global norm\nBias: Effectively no\nStability: High\nk=1 or k>1"]
        R6["REINFORCE++w/Baseline\nA_t = (R - mean_grp - KL)\n     / std_batch (global)\nBias: No, k2 KL, k>1\nVoid filter: Yes"]
    end

Experiments

Experiment 1: General RLHF — Chat-Arena-Hard

Setup: Llama-3-8B-SFT trained on 20,000 diverse prompts using a Bradley-Terry reward model trained on ~700K human preference pairs. Policy trained for multiple steps using OpenRLHF. Evaluation: Chat-Arena-Hard (a hard normalized leaderboard based on LLM judge comparisons).

Table 1: General RLHF comparison (Chat-Arena-Hard)

AlgorithmNorm TypeSamples/PromptScoreLength (tokens)Per-Token Score
REINFORCE++ (k=1)Global146.78320.0561
GRPO (k=4)Local446.88600.0544
RLOO (k=4)Leave-one-out444.68660.0515
ReMax (k=1+1)Fixed baseline1+145.18050.0560

Analysis:

  1. Score vs. Efficiency: REINFORCE++ (k=1) achieves nearly identical score to GRPO (k=4) while using 4× fewer responses per prompt. This translates directly to 4× lower reward model inference cost and lower memory pressure.

  2. Length and reward hacking: GRPO produces longer responses on average (860 vs. 832 tokens). Length increase is a classic sign of reward hacking — the model learns that longer responses get higher reward from some reward models. REINFORCE++‘s stable global normalization avoids this.

  3. KL dynamics: Training curves (Figure 2 in the paper) show REINFORCE++ maintains much lower KL divergence throughout training while achieving comparable reward. Lower KL means the trained policy stays closer to the reference model distribution, preserving model quality while gaining task-specific performance.

  4. RLOO degradation: Somewhat surprisingly, RLOO (k=4, leave-one-out baseline) scores lower (44.6) despite using 4× samples. This suggests the leave-one-out estimator has higher variance than either GRPO or REINFORCE++ for this regime.

Experiment 2: Catastrophic Overfitting on Small Datasets

This experiment is the most striking demonstration of GRPO’s local normalization flaw.

Setup: Train on only 30 questions from AIME-24 (a competition math dataset). Evaluate on AIME-25 (unseen questions of similar difficulty). The model used is not specified but is a base model trained from zero (RL from scratch without SFT initialization).

Table 2: Overfitting experiment — train 30 AIME-24, evaluate AIME-25

AlgorithmTrain (AIME-24) Pass@1Test (AIME-25) Pass@1Test (AIME-25) Pass@16
GRPO (local norm)95.0%0.0%0.4%
REINFORCE++ (global norm)71.0%2.5%40.0%

What is happening with GRPO: GRPO’s local normalization optimizes the policy to be “better than other responses to the same training prompts.” With only 30 training prompts, the model quickly learns to game those specific 30 problems — achieving 95% pass@1 on the training set. But the underlying reasoning skill does not transfer: the model achieves 0% pass@1 on unseen AIME-25 questions.

Why REINFORCE++ generalizes better: Global normalization compares each response’s reward against the global batch mean across all training prompts. The policy must improve on a diverse set of prompts simultaneously — it cannot specialize to individual training prompts. REINFORCE++ trains more slowly (71% on training set) but achieves meaningful test performance (2.5% pass@1, 40% pass@16).

The pass@16 gap (0.4% vs 40%) is especially revealing: GRPO has essentially collapsed the model’s diversity — when given 16 chances to solve an AIME-25 problem, it still almost never succeeds because all 16 responses are very similar (the model has converged to a narrow, training-prompt-specific strategy). REINFORCE++ preserves solution diversity, so pass@16 is far higher than pass@1.

Experiment 3: OOD Generalization — K&K Logic Puzzles

Knights and Knaves (K&K) puzzles require deductive reasoning about logical constraints. Difficulty increases naturally with the number of “people” in the puzzle (more people = more constraints = harder). This provides a natural OOD test: train on 2-5 person puzzles, test generalization on 6-8 person puzzles.

Results: GRPO is competitive on easy tasks (2-3 people) but performance collapses on harder OOD tasks (6-8 people). REINFORCE++ outperforms GRPO on all tasks with 4+ people and achieves a much higher average accuracy (62.1 vs. 55.7). The global normalization forces the model to build reasoning heuristics that transfer across difficulty levels, rather than memorizing patterns from specific training puzzle structures.

This experiment connects to a broader point: local normalization creates an implicit curriculum where “winning within the local group” is the objective, but this does not generalize to harder out-of-distribution tasks.

Experiment 4: Complex Tool-Use Agent — ZeroTIR

The hardest test: training a Qwen-2.5-Base-7B model from scratch (RL from Zero — no SFT initialization) to use Python tools for mathematical problem solving in a multi-turn environment. This requires:

  • Learning to invoke Python tool calls syntactically
  • Learning to use the tool results to guide reasoning
  • Long-horizon credit assignment (tool calls early in the response affect the final answer)
  • Handling void samples (many early-training responses fail entirely)

Training uses datasets from ORZ and DAPO. Evaluation on AIME 2024, AIME 2025, HMMT Feb 2024/2025, and CMIMC using the average@32 metric (average accuracy over 32 independent samples per problem).

Table 4: Complex tool-use benchmark (average@32)

AlgorithmAIME’24AIME’25HMMT’25HMMT’24CMIMCAvg
GRPO (local norm, k>1)31.6621.8716.9717.7024.6822.58
PPO (full critic)30.2021.6615.0018.4323.9521.85
REINFORCE++w/Baseline30.8327.1817.9118.9525.6224.10

Key results:

  1. REINFORCE++w/Baseline beats PPO (+2.25 avg score) despite having no critic network. This is the headline result: a simpler algorithm outperforms the heavyweight.

  2. The biggest gain is on AIME-2025 (hardest, most OOD): +5.31 over GRPO, +5.52 over PPO. This confirms that global normalization specifically helps on hard, generalizing tasks.

  3. GRPO is worse than PPO on average (22.58 vs. 21.85) — local normalization’s overfitting is especially damaging in this complex multi-turn agentic setting.

  4. The void-filtering property is crucial here: In early training on RL-from-Zero, many responses fail entirely (reward = 0). Group-mean subtraction automatically filters these void samples by giving zero gradient, keeping training stable when most responses are uninformative.

Best Practices: When to Use Which Variant

The paper provides clear guidance based on two key dimensions:

Use REINFORCE++ (k=1) when:

  • General-domain chat alignment with continuous Bradley-Terry reward
  • Prompt diversity is paramount (large diverse prompt sets)
  • Process-supervised reward models (PRMs) where getting k responses per prompt is expensive
  • Online/realtime RL where each step must be fast
  • Symmetric reward signals (e.g., -1 to +1 range)

Use REINFORCE++w/Baseline (k>1) when:

  • Sparse binary rewards (RLVR: math correctness, code pass/fail, rule compliance)
  • Complex multi-step reasoning or agentic tool-use tasks
  • High void sample rate (many responses get reward = 0)
  • Both 0/1 and -1/1 reward formats (group-mean subtraction handles both)
  • Low-data regimes where prompt-level overfitting is a risk

Practical hyperparameter guidance (from third-party validation):

  • Batch size B512B \geq 512, preferably B=1024B = 1024 for stable global statistics
  • For w/Baseline variant: group size k=4k = 488 is typical
  • KL coefficient β=0.01\beta = 0.010.10.1 (smaller for RLVR tasks, larger for chat alignment)
  • The k2k_2 KL coefficient λ=0.001\lambda = 0.0010.010.01 (much smaller than β\beta since it’s a separate loss term)

Reproducibility Notes

The algorithm is implemented in OpenRLHF (arXiv 2405.11143), which is open-source and widely used. Both variants are available:

  • REINFORCE++ in OpenRLHF as reinforce_plus_plus trainer
  • REINFORCE++w/Baseline available in the same framework

Key configuration settings to reproduce results:

  • Use global normalization flag (off by default in most frameworks that default to GRPO-style local norm)
  • k2 KL estimator (not k3) for the w/Baseline variant
  • Large batch size for global statistics stability
  • No critic model initialization needed

The paper trains on a setup compatible with 8×A100 80GB or equivalent, using the ZeroTIR environment from Mai et al. (2025) for the agentic experiments. The reasoning experiments use standard AIME/AMC competition math datasets that are publicly available.

Limitations and Boundary Conditions

  1. Batch size dependency: Global normalization requires large batches (B512B \geq 512) for stable mean/std estimates. At 70B+ parameter scale where per-GPU batch size is very small and gradient accumulation is needed, global stats may span multiple gradient accumulation steps, complicating implementation.

  2. Homogeneous reward scale assumption: Global normalization assumes all rewards in the batch are comparable in scale. For multi-task training with heterogeneous reward types (chat + code + safety rewards with different scales), pre-normalizing rewards per task type before applying global normalization may be necessary.

  3. k=1 variant is not optimal for RLVR: For sparse 0/1 rewards, k=1 may generate many zero-reward training samples. The paper acknowledges this and recommends REINFORCE++w/Baseline for RLVR tasks, but does not provide guidance on choosing kk given the expected pass rate.

  4. No comparison with Online DPO / SimPO / DAPO: The paper focuses on the REINFORCE/PPO family. DAPO (Yu et al., 2025) independently identified similar issues with GRPO’s local normalization and proposed token-level advantage clipping as an alternative fix. A direct comparison is absent.

  5. OOD experiments use only math/logic benchmarks: All OOD generalization tests are on mathematical reasoning tasks. Whether global normalization similarly improves OOD generalization on coding, safety alignment, or general instruction following is not demonstrated.

Critical Assessment: Weaknesses & Improvements

Weaknesses and Flaws

(a) The overfitting experiment is an extreme, artificially small regime: Table 2 trains on only 30 AIME-24 questions — an extreme data scarcity scenario far from typical RLHF deployments (which use tens of thousands to millions of prompts). GRPO’s 95%→0% collapse is striking, but does not answer the more practically important question: at what training set size does GRPO’s local normalization cause meaningful (but not catastrophic) overfitting compared to global normalization? A data-size scaling experiment would be far more informative.

(b) No batch-size ablation for global normalization: The “effectively unbiased” property of global normalization is asymptotic (bias vanishes as BB \to \infty). But what is the minimum BB needed for global normalization to outperform local normalization in practice? For large model training where B<64B < 64 may be necessary due to GPU memory constraints, this question is critical and completely unaddressed. The paper never shows a batch-size vs. performance curve.

(c) Decomposition ablation is missing for w/Baseline variant: REINFORCE++w/Baseline introduces two changes over GRPO simultaneously: (1) global std instead of local std, and (2) k2 KL estimator instead of k3. There is no 2×2 ablation table isolating these contributions. The k2 vs. k3 change may account for a significant fraction of the improvement, especially since a separate paper (Liu et al., 2025a) specifically showed that k3 is an unstable approximation.

(d) Comparison with DAPO is missing: DAPO (released around the same time) also modifies GRPO to improve training stability, using token-level advantage clipping and entropy regularization. Since both papers claim to fix GRPO’s instability for reasoning tasks and both use similar experimental setups (math benchmarks, Qwen base models), a direct comparison would be the natural experiment — but it is absent from both papers.

(e) The k3 estimator critique needs more rigor: The paper cites Liu et al. (2025a) for the claim that k3 produces unstable gradients, but does not provide its own experimental demonstration. A simple training curve comparison (k2 vs. k3 KL estimator, all else equal) would strengthen the argument considerably.

Limitations the Authors Understate or Omit

(f) The “global vs. local” framing obscures a nuanced tradeoff: Global normalization compares rewards from completely different prompts with different inherent difficulties. This introduces a different kind of noise: if easy prompts have reward 0.9 and hard prompts have reward 0.1, the global std will be large and advantages will be diluted. Local normalization’s bias is well-characterized (as the paper proves), but global normalization’s sensitivity to prompt difficulty distribution is not analyzed. This matters for curriculum learning settings where easy and hard prompts are mixed.

(g) The paper does not quantify the practical bias of GRPO: It proves the bias exists theoretically, but never measures its magnitude empirically (e.g., by comparing advantages from local vs. global normalization against the true Monte Carlo advantage). This makes it hard to judge how significant the bias is in practice for typical GRPO hyperparameters (k=4, k=8).

(h) k=1 efficiency claim ignores batch composition costs: The paper claims k=1 is “4× more efficient” than k=4 GRPO because it needs 4× fewer reward model calls per prompt. However, REINFORCE++ with k=1 needs 4× more diverse prompts per batch to maintain comparable signal quality (you need B=1024B = 1024 distinct prompts rather than B/4=256B/4 = 256 prompts × 4 responses). Loading and sampling 4× more diverse prompts may not always be feasible or efficient, especially for specialized datasets (e.g., 30 AIME problems).

Concrete Improvement Suggestions

  1. Batch-size ablation experiment: Train REINFORCE++ with B{64,128,256,512,1024}B \in \{64, 128, 256, 512, 1024\} and measure final performance on a fixed benchmark. This would provide the most important practical guidance missing from the paper.

  2. Full decomposition ablation for w/Baseline: Present a 2×2 table: (local/global norm) × (k2/k3 KL) — 4 configurations on the ZeroTIR benchmark. This would definitively establish which change matters more.

  3. Data-size scaling experiment: Train GRPO vs. REINFORCE++ on prompt sets of size 30, 100, 500, 2000, 10000, and plot test accuracy vs. training set size. This would show at what scale local normalization’s overfitting becomes negligible, giving practitioners guidance on when GRPO is “safe enough” to use.

  4. Prompt difficulty distribution analysis: Analyze the effect of mixing easy and hard prompts (different reward distributions) on global normalization stability. Propose a per-difficulty reward normalization scheme if needed.

  5. Comparison with DAPO at equivalent compute: Run DAPO, REINFORCE++, and GRPO on identical hardware/compute budgets on the ZeroTIR benchmark and report both performance and GPU-hours. This would establish the Pareto frontier of performance vs. compute.

Conclusion

REINFORCE++ makes a clean theoretical argument and backs it with solid empirical evidence. The insight — that per-prompt normalization in GRPO creates a biased, unstable advantage estimator, and that global batch normalization fixes it — is simple to understand, easy to implement, and practically impactful.

The two-variant design is thoughtful: REINFORCE++ (k=1) for general alignment training and REINFORCE++w/Baseline (k>1) for complex reasoning tasks. The paper cleanly explains the intuition for why group sampling helps for sparse rewards (void filtering via group-mean subtraction) while global normalization handles stability across both variants.

The algorithm has been independently validated and adopted in multiple large-scale systems: OpenRLHF, TRL (HuggingFace’s training library), Seed1.5-Thinking, and ScaleRL’s 16,000 GPU-hour experiments all confirm that global batch normalization is “slightly superior in both compute efficiency and final performance” over GRPO’s local normalization. This widespread adoption is the strongest real-world endorsement of the paper’s core contribution.

The main weaknesses are the missing batch-size ablation, the isolated KL estimator contribution, and the artificially small overfitting experiment. These are limitations of scope rather than correctness — the core theoretical result is sound and the method works in practice. For anyone building RLHF systems today, REINFORCE++ is a simple, theoretically justified replacement for GRPO that adds stability and generalization at no additional cost.

References

  • REINFORCE++ (2025): Jian Hu et al. arXiv:2501.03262v9
  • PPO (2017): Schulman et al. arXiv:1707.06347
  • GAE (2018): Schulman et al. arXiv:1506.02438
  • GRPO / DeepSeekMath (2024): Shao et al. arXiv:2402.03300
  • RLOO (2024): Ahmadian et al.
  • ReMax (2023): Li et al. arXiv:2310.10505
  • InstructGPT / RLHF (2022): Ouyang et al. arXiv:2203.02155
  • DPO (2023): Rafailov et al., NeurIPS 2023
  • DAPO (2025): Yu et al. arXiv:2503.14476
  • OpenRLHF (2024): Hu et al. arXiv:2405.11143
  • ScaleRL (2025): Khatri et al. arXiv:2510.13786
  • DLER (2025): Liu et al. arXiv:2510.15110
  • LitePPO (2025): Liu et al. arXiv:2508.08221
  • VAPO (2025): Yue et al. arXiv:2504.05118
  • DeepSeek-R1 (2025): Guo et al. arXiv:2501.12948
  • KL Regularization Analysis (2025): Liu et al. arXiv:2510.01555
  • Agent RL Scaling Law (2025): Mai et al. arXiv:2505.07773

Appendix A: Deriving the Full Bias Expression for GRPO

This section expands the bias proof to give a quantitative sense of how large the bias is for typical GRPO hyperparameters (k=4k = 4 or k=8k = 8).

Setting Up the Bias Calculation

Under the Gaussian model from the proof (Eq. P1), with group size NN and true advantage ϵi\epsilon_i:

Numerator expectation (from Eq. P3):

E[ϵiϵˉϵi]=(11N)ϵiE[\epsilon_i - \bar{\epsilon} \mid \epsilon_i] = \left(1 - \frac{1}{N}\right) \epsilon_i

Denominator expectation:

From Eq. P4:

E[D2ϵi]=(N1)2σ2N2+(N1)N2ϵi2E[D^2 \mid \epsilon_i] = \frac{(N-1)^2 \sigma^2}{N^2} + \frac{(N-1)}{N^2}\epsilon_i^2

For the denominator DD itself (using E[D]E[D2]E[D] \approx \sqrt{E[D^2]} for small relative variance of D2D^2):

E[Dϵi](N1)N2((N1)σ2+ϵi2)E[D \mid \epsilon_i] \approx \sqrt{\frac{(N-1)}{N^2}\left((N-1)\sigma^2 + \epsilon_i^2\right)}

Approximate bias of AiA_i:

E[Aiϵi](11/N)ϵi(N1)N2((N1)σ2+ϵi2)E[A_i \mid \epsilon_i] \approx \frac{(1 - 1/N)\epsilon_i}{\sqrt{\frac{(N-1)}{N^2}\left((N-1)\sigma^2 + \epsilon_i^2\right)}}

For a “correct” (unbiased) normalized advantage, we’d want E[Aiϵi]=ϵi/σE[A_i \mid \epsilon_i] = \epsilon_i / \sigma (z-score). The bias ratio is:

E[Aiϵi]ϵi/σ=(11/N)σ(N1)N2((N1)σ2+ϵi2)=(N1)σ(N1)2σ2+(N1)ϵi2=11+ϵi2/((N1)σ2)\frac{E[A_i \mid \epsilon_i]}{\epsilon_i / \sigma} = \frac{(1-1/N)\sigma}{\sqrt{\frac{(N-1)}{N^2}((N-1)\sigma^2 + \epsilon_i^2)}} = \frac{(N-1)\sigma}{\sqrt{(N-1)^2 \sigma^2 + (N-1)\epsilon_i^2}} = \frac{1}{\sqrt{1 + \epsilon_i^2/((N-1)\sigma^2)}}

Numerical examples:

Group size NNϵi/σ\epsilon_i / \sigma (standardized advantage)Bias ratio
40 (average response)1.00 (no bias for zero advantage)
41 (1 std above mean)1/1+1/30.8661/\sqrt{1 + 1/3} \approx 0.866 (13% underestimate)
42 (2 std above mean)1/1+4/30.6551/\sqrt{1 + 4/3} \approx 0.655 (35% underestimate)
811/1+1/70.9351/\sqrt{1 + 1/7} \approx 0.935 (6.5% underestimate)
821/1+4/70.7981/\sqrt{1 + 4/7} \approx 0.798 (20% underestimate)
64 (global batch)11/1+1/630.9921/\sqrt{1 + 1/63} \approx 0.992 (0.8% underestimate)

Interpretation: For typical GRPO with k=4k = 4, the advantage of a response 2 standard deviations above the group mean is underestimated by 35%. This means the best responses in the group receive 35% less gradient reinforcement than they should — a substantial suppression of the gradient signal from the most informative training examples. At k=8k = 8, the bias is 20% for the same response. For global batch normalization with B=64B = 64 prompts, the bias drops to less than 1%.

This quantitative analysis shows the bias is not a theoretical curiosity — it has meaningful practical consequences for the gradient signal quality.

Appendix B: Practical Implementation Notes

Implementing Global Batch Normalization in PyTorch

The core computation is simple. The key engineering consideration is that advantages must be computed across all prompts in the batch before the PPO update loop, not per-prompt.

# Pseudocode for REINFORCE++ global normalization
# Assumes: advantages is a flat tensor of all token-level advantages
#          across all prompts in the batch

def global_normalize(advantages: torch.Tensor, eps: float = 1e-8) -> torch.Tensor:
    """
    Global advantage normalization (REINFORCE++ Eq.5).
    
    Arguments:
        advantages: (total_tokens,) flat tensor of raw token-level advantages
                    from ALL prompts in the batch
        eps: numerical stability constant
    Returns:
        Normalized advantages with zero mean, unit variance
    """
    mean = advantages.mean()
    std = advantages.std(unbiased=False)  # biased std for stability
    return (advantages - mean) / (std + eps)


# In training loop:
all_advantages = []
for prompt, response in batch:
    r = reward_model(prompt, response)
    kl_penalties = compute_per_token_kl(policy, ref_policy, prompt, response)
    # KL incorporated into reward at each token position
    raw_adv = torch.full_like(kl_penalties, r) - beta * kl_penalties.flip(0).cumsum(0).flip(0)
    all_advantages.append(raw_adv)

# Stack and normalize globally — DO NOT normalize per-prompt!
all_advantages = torch.cat(all_advantages)   # shape: (total_tokens,)
normalized_advantages = global_normalize(all_advantages)

# Assign back to each prompt's tokens and run PPO update
# ...

Critical implementation detail: The kl_penalties.flip(0).cumsum(0).flip(0) trick computes i=tTKLi\sum_{i=t}^{T} \text{KL}_i (cumulative sum from tt to TT) by reversing, cumsum, reversing again. This is more efficient than a loop.

Implementing Group-Mean Subtraction (w/Baseline variant)

def group_mean_subtract_then_global_normalize(
    group_rewards: list[list[float]],  # shape: [n_prompts, k]
    eps: float = 1e-8
) -> list[list[float]]:
    """
    REINFORCE++w/Baseline two-step normalization.
    
    Step 1: Subtract group mean (reward reshaping)
    Step 2: Global batch normalization (stability)
    """
    # Step 1: Group mean subtraction
    after_group_sub = []
    for prompt_rewards in group_rewards:
        mean_g = sum(prompt_rewards) / len(prompt_rewards)
        # If all rewards equal (void sample), all advantages become 0 → no gradient
        after_group_sub.append([r - mean_g for r in prompt_rewards])
    
    # Flatten to global batch
    flat = [adv for group in after_group_sub for adv in group]
    
    # Step 2: Global normalization
    mean_b = sum(flat) / len(flat)
    std_b = (sum((x - mean_b)**2 for x in flat) / len(flat)) ** 0.5
    
    # Re-apply to grouped structure
    return [
        [(adv - mean_b) / (std_b + eps) for adv in group]
        for group in after_group_sub
    ]

Void sample filtering in action: If a group of k=4k = 4 responses all get reward = 0 (none solved the problem), then after group-mean subtraction all advantages are 00=00 - 0 = 0. These all-zero advantages contribute zero gradient to the PPO update — effectively filtering the uninformative sample from training. GRPO, by contrast, divides by stdgroup=0+ϵ\text{std}_\text{group} = 0 + \epsilon, causing the advantages to explode numerically.

Distributed Training Considerations

For multi-GPU training, global normalization requires computing statistics across all GPUs (all-reduce of mean and variance before normalizing). This is a minor communication overhead (two scalar all-reduce operations per step) compared to the much larger all-reduce operations needed for gradient synchronization. In practice:

# Distributed global normalization (pseudocode)
local_sum = advantages.sum()
local_sum_sq = (advantages ** 2).sum()
local_count = advantages.numel()

# All-reduce across all GPU workers
global_sum = dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
global_sum_sq = dist.all_reduce(local_sum_sq, op=dist.ReduceOp.SUM)
global_count = dist.all_reduce(local_count, op=dist.ReduceOp.SUM)

global_mean = global_sum / global_count
global_var = global_sum_sq / global_count - global_mean ** 2
global_std = global_var.sqrt()

normalized = (advantages - global_mean) / (global_std + eps)

This all-reduce overhead is negligible (two scalars vs. millions of gradient parameters). The computation is exactly analogous to Batch Normalization’s all-reduce in distributed training, which is a well-solved engineering problem.

Appendix C: Connection to Batch Normalization in Deep Learning

The global advantage normalization in REINFORCE++ is mathematically analogous to Batch Normalization (BN) in supervised learning (Ioffe & Szegedy, 2015). In BN:

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}

where μB\mu_{\mathcal{B}} and σB\sigma_{\mathcal{B}} are computed over the entire training mini-batch. BN was shown to dramatically stabilize neural network training by keeping activations in a well-normalized range, reducing sensitivity to initialization and learning rate.

REINFORCE++‘s global advantage normalization plays the exact same role in policy gradient training: it keeps the advantage signal (which drives gradient updates) in a well-normalized range, reducing sensitivity to:

  • Absolute reward scale (a reward model trained on [5,5][-5, 5] vs. [0,1][0, 1] does not matter)
  • Prompt difficulty variation (easy and hard prompts produce comparable advantage magnitudes)
  • Reward drift during training (as the policy improves and reward distributions shift, the global normalization adapts automatically)

The key difference: in supervised BN, normalization is applied to intermediate activations; in REINFORCE++, it is applied to the advantage estimates (which serve as targets for the policy gradient update). Both achieve the same effect — stable, well-conditioned training signal.

This connection also explains why the effect is so pronounced: the advantage function in policy gradient is the direct analog of the loss gradient in supervised learning. Unstable advantages → unstable policy gradients → unstable training. Normalized advantages → stable policy gradients → stable training. This is not a coincidence; it is the same fundamental principle of training stability applied in two different learning paradigms.