Review date: 2026-06-23 Review author: Zhongzhu Zhou Paper reviewed: Advancing LLM Reasoning with Natural Language and Numerical Feedback Paper authors: Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng arXiv: 2506.03106 Status/Venue: ICML 2026 Spotlight (43rd International Conference on Machine Learning, Seoul)
Short Answer
Critique-GRPO is an online reinforcement learning framework that augments the standard GRPO numerical reward signal with natural language critique feedback, enabling large language models to learn simultaneously from their initial responses and critique-guided refinements—fixing the three chronic failure modes of purely scalar reward RL training.
Prerequisites: What You Need to Know First
Before diving into the method, let’s build the background knowledge a reader needs. I’ll cover policy gradient RL, the GRPO algorithm, and why natural language feedback matters.
1. Reinforcement Learning for Language Models: The Basic Setup
In standard supervised fine-tuning (SFT), a language model learns to imitate correct demonstrations. The loss is a cross-entropy between the model’s token distribution and the ground truth tokens. The problem is simple: you need lots of labeled examples, and the model only learns to copy, not to reason.
Reinforcement learning (RL) for language models takes a different approach. Given a question , the model (the policy ) generates a response , and receives a reward based on whether the answer is correct. The goal is to maximize expected reward:
In the RLHF (Reinforcement Learning from Human Feedback) literature, this reward can come from a trained reward model that scores human preferences, or from verifiable outcomes (e.g., does a math answer match the ground truth?). The latter is called RLVR (RL with Verifiable Rewards), and it has powered reasoning breakthroughs like DeepSeek-R1.
The fundamental challenge: reward signals in language generation are sparse and non-differentiable. You can’t backpropagate through is_correct(y) = True/False. This is where policy gradient methods come in.
2. Policy Gradient Theorem
The core insight: even though is non-differentiable with respect to , the expected reward is differentiable in expectation. The policy gradient theorem (Williams, 1992) gives:
Intuitively: if response gets high reward, push higher; if it gets low reward, push it lower. The gradient of tells us which direction to push.
For a sequence of tokens , the log-likelihood decomposes as:
So the policy gradient becomes a sum over token-level gradients, weighted by the sequence reward. In practice, we use an advantage (reward minus a baseline) rather than raw reward to reduce variance.
3. Proximal Policy Optimization (PPO) and Its Challenges
PPO (Schulman et al., 2017) is the dominant policy gradient algorithm. It addresses the stability problem: large policy updates can destabilize training. PPO clips the probability ratio to prevent overly large steps:
The key challenge with PPO for LLMs is that computing the advantage requires a value function—a separate neural network that estimates the expected future reward from each state. Training this value function is expensive (doubles the GPU memory) and often unstable for long token sequences.
4. Group Relative Policy Optimization (GRPO)
GRPO (Shao et al., 2024) eliminates the value function by using a group of sampled responses to estimate advantages. For a given query , GRPO samples responses from the old policy and obtains rewards .
The advantage for each token in response is computed as:
This normalizes rewards within the group, so responses that are better than the group average get positive advantages (reinforced), and below-average ones get negative advantages (suppressed).
The GRPO objective is:
where the factor averages over token positions to avoid bias from response length.
GRPO eliminates the value network, making RL training far more memory-efficient. It underpins the DeepSeek-R1 and Qwen3 training pipelines.
5. Why Numerical Feedback Alone Falls Short
Here’s the key problem that this paper addresses. When rewards are binary (0 for wrong, 1 for right), the gradient signal is completely uninformative about where the model went wrong or how to correct it. The model gets “you’re wrong” with no actionable guidance.
Imagine a student getting back an exam with only a score—no comments, no red marks. They can’t improve without knowing what to fix.
Natural language feedback (NLF) like critiques directly says: “In Step 3, your geometric mean calculation assumed independence incorrectly. The correct approach is…” This is exponentially more informative than a binary score.
Three Fundamental Limitations of Numerical-Only RL
The paper’s first major contribution is identifying and empirically documenting three failure modes of RL with purely numerical feedback.
Limitation 1: Performance Plateaus
Even with massive data scaling (4k → 32k training prompts, 8× increase) or extended training, performance stagnates. The authors train Qwen2.5-7B-Base with R1-GRPO and observe saturation after 120 steps regardless of data scale.
Why? Binary rewards provide uniform zero-information signal for all incorrect responses, regardless of whether they’re partially correct or completely wrong. Once the model has learned to solve all “easy” problems (those where some random samples happen to be correct), the gradient becomes near-zero for the hard tail.
Limitation 2: Ineffective Spontaneous Self-Reflection
Does the model spontaneously start doing better reasoning under extended RL training? The authors categorize 6 cognitive behaviors:
- Planning behaviors: subgoal setting, summarization
- Self-reflection behaviors: verification, backtracking, backward chaining, anticipation
For problems solvable only by the RL-finetuned model (not the base model), self-reflection behaviors contribute minimally to success. The model learns what to do on easy problems through random exploration, but doesn’t develop principled error-correction abilities.
Limitation 3: Persistent Failures
Even the best RL-finetuned Qwen2.5-7B-Base consistently fails ~29% of training questions (Pass@4=0)—questions where even 4 attempts yield zero correct answers. These are systematically hard problems that the model never learns to handle.
Crucially, when you provide an explicit natural language critique to the failed model, it can successfully refine the solution 36.47% of the time (with CoT critique). This demonstrates that the model has the capability to correct itself, but the RL signal alone can’t activate it.
The Three Types of Critiques
The paper investigates three levels of natural language feedback, each providing progressively richer guidance:
┌─────────────────────────────────────────────────────────────┐
│ Critique Type 1: Indicative Critique │
│ "The generated solution is incorrect." │
│ → Binary failure signal only; no corrective guidance │
├─────────────────────────────────────────────────────────────┤
│ Critique Type 2: Indicative Critique w/ Ground Truth │
│ "The generated solution is incorrect, the ground │
│ truth is 7/25." │
│ → Failure signal + final answer; no step-level guidance │
├─────────────────────────────────────────────────────────────┤
│ Critique Type 3: CoT Critique │
│ "Let's analyze the student's solution step-by-step │
│ and identify any errors: │
│ ### Step 1: Geometry Understanding — [analysis] │
│ ### Step 7: Precise Calculation — [error identified] │
│ Conclusion: incorrect [END]" │
│ → Full chain-of-thought error localization + explanation │
└─────────────────────────────────────────────────────────────┘
Refinement success rates (Table 1, on 29.07% persistently failed questions):
- Indicative Critique: 2.09% valid refinements, 7.05% questions refined
- Indicative w/ GT: 1.98% valid refinements, 6.88% questions refined
- CoT Critique: 36.47% valid refinements, 55.37% questions refined
The CoT critique is transformatively more effective. Step-by-step error localization allows the model to pinpoint and fix specific reasoning failures—a capability that indicative or answer-only feedback can’t unlock.
Critique-GRPO: Method in Detail
flowchart LR
Q["Question q"] --> PM["Policy Model π_θ"]
PM --> Gen["Initial Responses\ny^(1)...y^(n)"]
Gen --> RS["Reward System\n(Rule or Model)"]
RS --> Crit["Critiques\nc^(1)...c^(n-1)"]
RS --> Rew["Scalar Rewards\nR^(1)...R^(n)"]
Gen --> GC["Group Compute\n(mean, std)"]
Rew --> GC
GC --> Adv["Weighted Advantages\nÂ_t^(i)"]
PM --> SelfRef["Self-Refinement\nvia Critiques y^(n)"]
SelfRef --> RS2["Reward Scoring\nR^refined"]
Adv --> PU["Policy Update\nJ_init + J_refi"]
RS2 --> PU
PU --> PM
Figure 1: Online Reinforcement Learning with Critique-GRPO. The model samples initial responses and refines them via in-context critique learning, combining both streams for policy optimization.
Step 1: Initial Response Sampling
For each query :
- Sample responses from the old policy
- Evaluate each response with the reward system to get binary scores: (1 for correct, 0 for incorrect)
- Generate critiques for each response using either:
- Rule-based: Heuristically construct indicative critiques (, or with ground truth appended)
- Model-based: A reward model generates CoT critiques: where is a critique instruction prompt
The binary correctness of the CoT critique determines the scalar reward: .
Step 2: Critique-Guided Self-Refinement
This step activates only when the initial response set contains zero correct solutions (all ). The motivation: if the model got at least one right, standard GRPO can already learn from the contrast; but when everything fails, critiques are needed to escape the zero-gradient trap.
Refinement generation: For each response , generate a refined response via in-context learning conditioned on the question-response-critique triplet:
where and is a refinement instruction.
Diversity-preserving selection: From the full refinement set, sample a subset of refinements that prioritizes correct solutions (if available); if no correct refinements exist, sample randomly. This prevents degenerate behavior where the model only generates one type of refinement.
The final training group combines initial and refined responses:
Rewards for refined responses are computed by re-evaluating correctness.
Step 3: Online Policy Optimization
The combined training objective is:
The initial response objective follows standard GRPO:
The refined response objective uses the same clipped PPO-style update:
Critical difference: The probability ratio for refined responses uses a policy shaping function instead of the standard importance sampling ratio :
Why policy shaping? Recall that refined responses are generated conditioned on the critique , but the policy model during inference won’t receive critiques—it must internalize the improvement. The shaping term in the denominator effectively up-weights gradient contributions from tokens that are low-probability under the current policy. This ensures the model pays more attention to unfamiliar correction patterns in refinements, even when its initial likelihood of those sequences is very low.
The intuition is: without shaping, standard importance sampling would assign near-zero weight to refinement tokens that the current policy considers unlikely, effectively ignoring the critique guidance. The -denominator prevents this.
Advantage computation: The advantages and are computed using the group mean of rewards from both initial and refined sets, ensuring a unified baseline:
Note: the KL divergence penalty is removed (following Liu et al., 2025), and the length-normalization factor and reward standard deviation are also excluded to avoid biased gradients—details discussed in the ablation.
Algorithm Pseudocode
Algorithm: Critique-GRPO Training
Input: Query dataset Q, old policy π_old, reward system RS,
group size n, refinement size k, shaping factor γ,
clip ε, learning rate α, critique type C ∈ {I, GT, CoT}
Output: Updated policy π_θ
For each training iteration:
For each batch of queries q from Q:
// Step 1: Initial Response Sampling
Sample {y^(1),...,y^(n)} ~ π_old(·|q)
For each i = 1,...,n:
Compute reward: R^(i) ← RS.score(q, y^(i))
Generate critique: c^(i) ← RS.critique(q, y^(i), C)
// c^(i) is null if C = I with no ground truth
// Check if self-refinement is needed
If all R^(i) = 0:
// Step 2: Critique-Guided Self-Refinement
For each i' = 1,...,n:
Generate: y^(i')_refined ~ π_old(·|I_refine, q, y^(i'), c^(i'))
Compute: R^(i')_refined ← RS.score(q, y^(i')_refined)
// Priority sampling: prefer correct refinements
Select k refinements {y^(i')_refined}_{i'=1}^k from full set,
prioritizing those with R^(i')_refined = 1; if none, sample random
// Combine training set
TrainingGroup ← {y^(i)}_{i=1}^n ∪ {y^(i')_refined}_{i'=1}^k
Else:
TrainingGroup ← {y^(i)}_{i=1}^n
// Step 3: Online Policy Optimization
Compute unified baseline: B ← mean({R^(i)} ∪ {R^(i')_refined})
For each y^(i) in TrainingGroup:
Compute advantage: Â^(i)_t ← R^(i) - B (per token)
If y^(i) is initial response:
Ratio: r^(i)_t(θ) ← π_θ(y^(i)_t|q,y^(i)_{<t}) / π_old(y^(i)_t|q,y^(i)_{<t})
Else (refined response):
Ratio: ρ^(i')_t(θ) ← π_θ(y^(i')_t|q,y^(i')_{<t}) /
[π_θ(y^(i')_t|q,y^(i')_{<t}) + γ] // policy shaping
Compute clipped objective: ℒ^(i) ← min(ratio × Â, clip(ratio, 1-ε, 1+ε) × Â)
// Gradient update
θ ← θ + α × ∇_θ [J_init(θ) + J_refi(θ)]
Theoretical Analysis: Complexity Reduction via Critique-Guided Exploration
The paper provides formal theoretical grounding via Proposition 4.1, which quantifies why critiques accelerate learning. Using the Transfer Eluder Dimension framework (Xu et al., 2025):
Setup: Consider a reasoning problem where the goal is to construct a hidden optimal solution , with each step . The action space is , and the hypothesis space is .
flowchart TD
A["Hypothesis Space ℱ\nAll possible solution strategies"] --> B["Standard Generation\n(Reward-Only)"]
A --> C["Indicative Feedback\nc_I or c_GT"]
A --> D["Constructive Feedback\nc_CoT"]
B --> B1["Search Space: O(|S|^L)\nBinary reward provides zero\ninfo on error location"]
C --> C1["Restricted Space: A_c ⊂ A\nSearch pruned but complexity\nstill O(|S|^L) worst-case"]
D --> D1["Decomposed: L sub-problems of |S|\nStep-by-step localization\nreduces to O(|S|·L)"]
Figure 2: How different critique types reduce the hypothesis search space. Constructive (CoT) critiques achieve exponential compression by decomposing the L-step problem into L independent 1-step problems.
Standard Generation (binary reward only): Since binary rewards only indicate if the final state is correct, the agent must effectively enumerate the full action space. The Eluder dimension scales as .
Indicative Feedback (, ): The critique acts as a pruning signal—conditioning the policy on the failure and optionally the ground truth restricts search to a subspace . However, worst-case complexity remains since error location within the sequence is unknown.
Constructive Feedback (): If the CoT critique localizes the first error to step , the problem decomposes into independent sub-problems of size . The hypothesis space for each step becomes , reducing total search complexity from to .
Corollary (sample efficiency): For fixed computational budget where :
Critique-guided exploration yields exponentially higher probability of finding the correct solution compared to pure random sampling. This is the theoretical engine behind the empirical gains.
Experimental Setup
Training data: 4k randomly sampled examples from a reorganized 46k subset of OpenR1-Math-220k (Bakouch et al., 2025).
Validation: Curated validation set from Yan et al., 2025.
Models tested: Qwen2.5-7B-Base (non-reasoning), Qwen3-8B (reasoning w/ thinking), Qwen2.5-Math-7B-Base, Llama-3.2-3B-Instruct, Qwen3-32B (in appendix)
Benchmarks:
- In-distribution (ID) math: MATH-500, Minerva-MATH, OlympiadBench, AMC 2023, AIME 2024/2025
- Out-of-distribution (OOD) science & general: TheoremQA, GPQA-Diamond, MMLU-Pro
Baselines compared:
- Supervised Learning: SFT, RAFT (on correct responses), Refinement FT (on correct refinements), Critique FT (on CoT critiques), CITL-FT (initial + refinement data)
- RL-based: R1-GRPO (standard GRPO), R1-DrGRPO (GRPO without optimization bias terms)
Implementation: Asynchronous rollouts via the VERL framework (Sheng et al., 2024).
Results
Main Results: Critique-GRPO vs. All Baselines
xychart-beta
title "Average Pass@1 on 8 Reasoning Tasks (Qwen2.5-7B-Base)"
x-axis ["Base Model", "SFT", "RAFT", "Ref-FT", "CritFT", "CITL-FT", "R1-GRPO", "R1-DrGRPO", "C-GRPO (Ind)", "C-GRPO (GT)", "C-GRPO (CoT)"]
y-axis 30 --> 50
bar [32.04, 33.04, 34.27, 35.21, 34.76, 35.66, 41.18, 42.66, 44.62, 45.30, 47.08]
Figure 3: Critique-GRPO (all variants) consistently outperforms all supervised and RL-based baselines on Qwen2.5-7B-Base. CoT critique achieves 47.08% average Pass@1 vs. 41.18% for standard R1-GRPO (+5.9 points).
Key findings from Table 2:
| Method | Avg (Qwen2.5-7B-Base) | Avg (Qwen3-8B) |
|---|---|---|
| Base | 32.04 | 53.23 |
| R1-GRPO | 41.18 | 63.75 |
| R1-DrGRPO | 42.66 | 64.46 |
| C-GRPO (CoT) | 47.08 | 68.26 |
- Critique-GRPO does not require expert demonstrations (unlike supervised fine-tuning variants).
- Compared to CITL-FT (which uses both initial responses and critique-guided refinements but in an offline SFT regime), Critique-GRPO outperforms by +11.4 points (47.08% vs 35.66%) on Qwen2.5-7B-Base and +12.4 points (68.26% vs 55.84%) on Qwen3-8B. Online RL training is essential, not just the data mixture.
Data Efficiency: Only 4k Training Examples
A striking result comes from Table 3 (Qwen2.5-Math-7B-Base):
| Method | Training Data | MATH-500 | Avg |
|---|---|---|---|
| SimpleRL-Zero* | 46k | 76.00 | ~34.5 |
| PRIME-Zero* | 46k | 81.40 | ~34.0 |
| Oat-Zero* | 46k | 81.40 | ~41.0 |
| Critique-GRPO (CoT) | 4k | 84.20 | 51.06 |
Critique-GRPO achieves 84.20% on MATH-500 with only 4k prompts—substantially outperforming PRIME-Zero (81.40%) that uses 46k prompts. The 10× data efficiency advantage arises directly from the complexity reduction proved in Proposition 4.1.
Self-Improvement via Self-Critiquing
One of the most compelling results: Critique-GRPO enables a model to improve itself by generating its own critiques—no external critique model required.
xychart-beta
title "Pass@k on AIME 2024 (Qwen3-8B) — Self-Critique vs Baselines"
x-axis ["k=1", "k=2", "k=4", "k=8", "k=16", "k=32", "k=64", "k=128", "k=256"]
y-axis 40 --> 100
line [66.7, 70.0, 76.7, 80.0, 83.3, 86.7, 90.0, 93.3, 93.3]
line [40.0, 46.7, 53.3, 60.0, 63.3, 70.0, 73.3, 80.0, 80.0]
line [50.0, 53.3, 56.7, 63.3, 66.7, 70.0, 73.3, 80.0, 80.0]
Figure 4: Pass@k scaling curves on AIME 2024 for Critique-GRPO (self-critique, blue), R1-GRPO (green), and base Qwen3-8B (yellow). Critique-GRPO consistently outperforms across all k values from 1 to 256.
- Pass@1 on AIME 2024: 66.7% (Critique-GRPO self-critique) vs 40.0% (R1-GRPO)—a remarkable +26.7 percentage point improvement
- The gains hold across all k values, indicating genuine capability improvement, not just calibration
Self-critiquing works because Critique-GRPO has internalized critique-based error correction. At inference time, the model can generate a critique of its own attempt and then refine. This bootstrapped self-improvement is emergent from online RL training.
Policy Entropy Dynamics
xychart-beta
title "Entropy Dynamics During Training (Qwen2.5-7B-Base)"
x-axis ["0", "50", "100", "150", "200", "250"]
y-axis 0 --> 2.2
line [1.5, 2.0, 1.8, 1.2, 0.7, 0.5]
line [1.5, 1.8, 1.2, 0.6, 0.3, 0.3]
line [1.5, 1.6, 1.0, 0.5, 0.3, 0.2]
Figure 5: Entropy dynamics comparison. Critique-GRPO (top curve) maintains higher policy entropy than R1-GRPO and R1-DrGRPO throughout training, indicating more sustained exploration. Early entropy peaks (steps 50-100) correspond to refinements that diverge significantly from initial responses.
Key entropy observations:
- Higher sustained entropy in Critique-GRPO corresponds to better exploration of rare but correct solution paths
- Early entropy spikes (before step 200) arise when critique-guided refinements dramatically diverge from initial responses—these divergences are productive because they explore high-advantage paths not reachable by standard generation
- The subsequent entropy decrease reflects the model rapidly internalizing the refined patterns
This is consistent with Cui et al. (2025b): rare, high-advantage actions increase policy entropy (promoting exploration), while common, high-advantage actions decrease it (consolidating gains).
Fine-Grained Ablation Study (Table 6)
xychart-beta
title "Cumulative Gains from Critique-GRPO Components (Qwen2.5-7B-Base)"
x-axis ["R1-GRPO", "+KL Remove", "+Lang. Feedback", "+Quality Sel.", "+Policy Shaping"]
y-axis 38 --> 50
bar [41.18, 42.66, 43.26, 43.95, 47.08]
Figure 6: Ablation showing cumulative contribution of each Critique-GRPO component over the R1-GRPO baseline. Policy shaping contributes the largest single gain (+3.13 points).
| Modification | Avg Pass@1 | Δ from Prior |
|---|---|---|
| R1-GRPO | 41.18 | — |
| + Remove KL loss | 42.66 | +1.48 |
| + Language feedback (CoT critique) | 43.26 | +0.60 |
| + Quality-based refinement selection | 43.95 | +0.69 |
| + Policy shaping | 47.08 | +3.13 |
Interpretation: Each component contributes, but policy shaping provides the largest gain (+3.1 points). This makes sense: without shaping, the gradient signal from refinements would be diluted by the standard importance sampling ratio (near-zero for low-probability corrections). Policy shaping ensures high-advantage refinement tokens receive proportionate gradient attention regardless of their current probability under .
Weak-to-Strong Generalization
Can a weaker model’s critiques improve a stronger model? Table 7 shows: yes.
Using Qwen3-8B-Base (weaker) to generate refinements for Qwen3-8B:
- Critique-GRPO (weaker refinement): 65.55% vs R1-GRPO: 63.75% (+1.8%)
Even lower-quality critiques from a weaker teacher can guide a stronger model. This has practical implications: you don’t need a state-of-the-art critique model; a cheaper model can serve as the critique provider.
Online Joint Optimization vs. Sequential Baseline
Table 8 compares Critique-GRPO against a two-stage sequential baseline: (1) run R1-GRPO to convergence, then (2) SFT fine-tune on critique-generated refinements. Results:
| Method | MATH-500 | AMC23 | GPQA | Avg |
|---|---|---|---|---|
| R1-GRPO | 74.00 | 42.50 | 33.33 | 41.18 |
| R1-GRPO + Ref-SFT | 75.40 | 47.50 | 41.20 | 43.15 |
| Critique-GRPO | 77.80 | 62.50 | 37.88 | 47.08 |
Critique-GRPO substantially outperforms the sequential approach, especially on hard OOD tasks (AMC23: +15 points). Online joint optimization is essential—the simultaneous learning from initial responses and refinements creates synergies that staged training cannot replicate.
Critical Assessment: Weaknesses & Improvements
Having reviewed the method and results thoroughly, I want to offer a candid critical assessment of what the paper does well and where it falls short.
Weaknesses & Flaws
1. Limited Scale of Policy Models
The main results are on 3B and 7-8B parameter models. In 2026, frontier models operate at 70B-700B parameters where training dynamics differ substantially. The performance gains at 7B may not translate—at larger scale, standard GRPO training sees different saturation patterns, and the computational overhead of generating critiques at inference time during training could become prohibitive. The paper’s Table 5 does test Llama-3.2-3B and Qwen3-32B, but doesn’t systematically study how gains scale with model size.
2. CoT Critique Model Dependency
The strongest results use CoT critiques from GPT-4o (Table 5). When using weaker open-source critique models (Llama-3.1-405B: 46.79%, DeepCritic-7B: 47.98% on Qwen2.5-7B-Base), gains are reduced. The dependency on critique quality creates a circular problem: generating high-quality CoT critiques requires a capable model, but you’re trying to improve a less-capable model. In practical deployments where you can’t use GPT-4o for every training step, the gains may be smaller.
3. Evaluation Coverage is Narrow
All eight benchmarks are mathematical or STEM reasoning tasks. The claim that Critique-GRPO improves “complex reasoning capabilities” is not tested on:
- Code generation (where critique-guided refinement would be natural, via compile errors or test failures)
- Multi-hop QA beyond GPQA
- Open-ended tasks where correctness verification is non-trivial
- Tasks requiring multi-step tool use
The method may be very domain-specific to math reasoning, where binary correctness is easy to verify.
4. Compute Cost Not Fully Disclosed
Generating CoT critiques and running self-refinement adds significant computational overhead to each training step. Table 5 in Appendix (mentioned but not fully reproduced in main paper) discusses costs, but the main paper provides no wall-clock comparison. A 2×-3× training time overhead would significantly change the efficiency story, especially when compared against “just use more data” alternatives.
5. Persistent Failure Cases Still Exist
Table 1 shows CoT critique enables refinement of 55.37% of persistently failed questions—which means 44.63% of the hardest questions remain unsolved even with CoT critique. The paper doesn’t discuss what characterizes these truly intractable cases or whether critique quality (not just type) is the bottleneck. Are they hard because the concept is missing from training data? Because the critique itself is wrong? This analysis gap means practitioners can’t predict when Critique-GRPO will help.
Limitations the Authors Understate
Reward Model Reliability: The paper notes that the reward model for CoT critique “determines the scalar reward: .” This means reward quality is fully delegated to the critique model. If the critique model makes errors (marks correct solutions as incorrect, or vice versa), those errors propagate directly into the policy gradient. The paper does not report how often this happens or how it affects training stability.
Refinement Correctness ≠ Training Signal Quality: A valid refinement (one that produces a correct final answer) may still exhibit incorrect reasoning steps. The reward signal is outcome-based (+1 for correct final answer), but the refinement trajectories being trained on may contain flawed intermediate steps. This creates a subtle alignment problem: the model learns to produce refinements that eventually reach correct answers, but not necessarily through valid reasoning chains.
Distribution Shift from Conditioning on Critiques: Self-refinement is trained conditioned on critiques , but during pure inference (Section 5.4, self-critique scenario), the model generates its own critique before refining. There’s an implicit assumption that self-generated critiques during inference have similar quality to training-time critiques. This is not formally verified.
Concrete Improvement Suggestions
1. Ablate Critique Quality vs. Critique Type: The paper compares three types (Indicative, GT, CoT) but doesn’t ablate critique quality within CoT critiques. An experiment with different critique model sizes (3B, 7B, 70B, 405B) generating CoT critiques would clarify whether quality or structure matters more.
2. Test on Code Generation: The method should naturally transfer to code with compiler errors or test-case failures as critique signals. A systematic evaluation on HumanEval/MBPP/SWE-bench would validate the claimed generality.
3. Report Training Throughput: Publishing tokens/second or GPU-hours per checkpoint for Critique-GRPO vs. R1-GRPO would allow practitioners to make informed decisions. If Critique-GRPO requires 3× training time, then a fair comparison against R1-GRPO with 3× more steps may narrow the gap.
4. Study Critique Error Rate: Track how often the critique model incorrectly labels correct solutions as incorrect, and measure the effect of this noise on final policy quality. A curriculum that starts with reliable (rule-based) critiques and gradually transitions to more expressive (CoT) critiques could be more robust.
5. Long-horizon Reasoning: Extend the evaluation to multi-step agentic tasks (e.g., tool use, sequential planning) where the critique must identify errors across a much longer trajectory. This would test whether the complexity reduction of Proposition 4.1 holds in the regime.
The Six Cognitive Behaviors: What RL Actually Learns
Section 3.1 of the paper includes a fascinating behavioral analysis that deserves more attention. The authors categorize six cognitive behaviors emerging during RL fine-tuning:
Planning behaviors (help produce correct solutions):
- Subgoal setting: decomposing the problem into smaller sub-tasks before solving
- Summarization: periodically reviewing progress and key facts
Self-reflection behaviors (expected to help but actually don’t, much): 3. Verification: checking if a computed answer satisfies the problem constraints 4. Backtracking: abandoning a failed reasoning path and starting over 5. Backward chaining: starting from the goal state and working backwards 6. Anticipation: predicting likely errors before they occur
The surprising finding: for problems that only the RL-finetuned model can solve (not the base model), self-reflection behaviors contribute minimally to success. The model doesn’t learn to use these behaviors strategically—it learns to use them superficially.
Why? With binary rewards, there’s no signal about when verification is needed or which backtracking path to pursue. The model may learn to insert verification-like tokens without any real computation backing them. This is the RL equivalent of a student writing “Let me double-check…” but then just copying the same answer.
Critique-GRPO addresses this by providing explicit guidance: the critique says “Step 3 is wrong, here’s why.” This gives the model a concrete target for its self-reflection—backtracking to step 3 and recomputing it becomes a reward-maximizing strategy, not just a cosmetic behavior.
Connecting the Dots: Why Online Joint Training Matters So Much
A key insight from the results deserves deeper analysis. The paper demonstrates that online joint optimization substantially outperforms a sequential approach (first run GRPO, then SFT on critique-refined data). Why is this?
Consider what each approach teaches the model:
Sequential approach (R1-GRPO → Refinement-SFT):
- GRPO converges on the current model’s natural generation distribution. It learns to distinguish “better vs. worse” responses from the random exploration of .
- After GRPO, you collect refinements from the converged model and fine-tune. But the refinements are designed to fix the failures of that specific converged model—they don’t interact with the RL optimization dynamics.
Online joint approach (Critique-GRPO):
- The critique-guided refinements inject targeted high-quality samples into the RL loop before the model fully converges.
- Refinements provide positive reward signals for exactly the questions where pure exploration fails.
- The policy gradient from refinements shapes the model to explore correction strategies it would never discover through random sampling.
- As the policy improves, it generates better initial responses, which changes the distribution of needed refinements, creating a positive feedback loop.
This is fundamentally different from staged training. The interaction between initial generation learning and refinement learning—both happening simultaneously—creates synergistic dynamics. The policy simultaneously becomes better at generating initial attempts and internalizing the correction patterns that critiques reveal. Neither stage alone achieves this.
The Role of Reward Shaping: A Deeper Look
The policy shaping function deserves more detailed analysis since it’s the highest-impact component in the ablation (+3.1 points).
The Standard Importance Sampling Problem
When optimizing on refinements with standard importance sampling:
The problem arises when assigns very low probability to a refinement token: .
In this case, , and the gradient contribution is:
Even if is large and positive (the refinement is much better than average), the near-zero importance weight kills the gradient. The model never learns from high-quality but low-probability corrections.
How Policy Shaping Fixes This
The shaping function:
As : , but at a much slower rate than the standard ratio. The prevents the denominator from collapsing to zero.
As : .
This function maps the probability into a bounded range, where low-probability tokens still receive non-trivial weights. In effect, the model cannot “escape” from learning rare but valuable correction patterns by simply assigning them low probability.
The intuition parallels curriculum learning: you want the policy to pay attention to unfamiliar examples even when they seem foreign. Standard IS would let the model ignore them; policy shaping keeps them visible.
Comparison with KL Penalty
The standard alternative to policy shaping would be to add a KL divergence term: . But KL penalizes any deviation from the old policy, including beneficial deviations toward high-quality refinements. The ablation confirms this: removing KL (+1.48 points) helps. KL is counterproductive when you want the policy to diverge from its current distribution toward critique-guided improvements.
Policy shaping is subtler: it specifically amplifies gradients from low-probability (novel) corrections without broadly penalizing divergence. This targeted mechanism is why it outperforms the KL approach.
Comparison with Related Methods
To fully appreciate where Critique-GRPO fits, let’s compare it with the most relevant prior work.
flowchart TD
A["RL for LLM Reasoning"] --> B["Numerical Only\n(GRPO, PPO, REINFORCE++)"]
A --> C["Supervised Critique\n(CITL-FT, Critique FT)"]
A --> D["Online + NLF\n(Critique-GRPO)"]
A --> E["Expert Demo-Guided\n(R1-DrGRPO, RAFT)"]
B --> B1["✓ No expert demos\n✗ Performance plateau\n✗ No error guidance"]
C --> C1["✓ Learns from critique\n✗ Offline only\n✗ Needs expert demos\n✗ No RL exploration"]
D --> D1["✓ No expert demos\n✓ Online RL dynamics\n✓ NLF error guidance\n✓ Self-improvement\n✗ Compute overhead"]
E --> E1["✓ Strong performance\n✗ Requires expert demos\n✗ Limited exploration"]
Figure 7: Positioning Critique-GRPO in the RL-for-LLM landscape. It occupies the unique quadrant of online RL + natural language feedback without expert demonstrations.
vs. REINFORCE++/GRPO variants (DAPO, VAPO, GSPO): These all operate with purely numerical rewards and differ in advantage normalization or clipping strategies. They address training stability, but not the fundamental expressiveness limitation of binary feedback.
vs. CITL-FT (Critique-in-the-Loop Fine-Tuning): CITL-FT uses both initial responses and critique-guided refinements as training data—but in an offline supervised setting. The policy gradient dynamics are absent; the model learns from a static dataset rather than adapting online. This explains the +11.4 point gap between Critique-GRPO and CITL-FT.
vs. Self-Refine / Reflexion: These inference-time methods use critiques to iteratively improve outputs at test time but don’t update model weights. Critique-GRPO uses critiques for weight updates, permanently improving the model rather than just improving a single inference chain.
vs. Process Reward Models (PRM): PRMs provide dense intermediate rewards at each reasoning step—similar to CoT critique feedback. However, PRMs require labeled process data (step-level annotations), which is expensive. Critique-GRPO’s model-based critiques are fully automated.
Implementation Details Worth Noting
VERL Framework for Asynchronous Rollouts
The paper implements Critique-GRPO via the VERL (Versatile Reinforcement Learning) framework, which supports asynchronous rollout generation. This is important because:
-
Standard synchronous rollouts: Policy model generates all responses, waits for all of them to complete, then does a gradient update. GPUs sit idle during the long generation phase.
-
Asynchronous rollouts: New training batches start before all responses from the previous batch are generated. This dramatically improves GPU utilization when generation is the bottleneck (which it is for LLMs).
For Critique-GRPO specifically, the two-stage generation (initial responses + critique-guided refinements) would be particularly wasteful with synchronous rollouts. VERL’s asynchronous design makes the critique-refinement loop computationally tractable.
Critique Instruction Prompts
The paper mentions critique instruction and refinement instruction but doesn’t provide them in the main text (they’re in Appendix M/E). Key design principles:
- must instruct the critique model to evaluate the response step-by-step and identify the first error location
- must instruct the policy to re-solve the problem given the original attempt and its critique
- The critique must conclude with
incorrect [END]orcorrect [END]so the reward system can extract a binary label
The binary label extraction from CoT critique outputs is where reward model reliability matters—a malformed critique output could produce an incorrect label.
Hyperparameter Sensitivity
Key hyperparameters introduced by Critique-GRPO:
- : number of refinements to sample (balances exploration breadth vs. computational cost)
- : policy shaping factor (controls how aggressively to upweight low-probability refinements)
- Refinement trigger threshold: here “all ” is binary; a soft threshold or probability-based trigger could be worth exploring
The ablation validates the choices made, but a systematic hyperparameter sensitivity analysis (what happens when varies?) would be valuable for practitioners.
Conclusion
Critique-GRPO represents a principled step beyond binary reward RL for language models. By integrating natural language critiques into the online RL loop—not just as post-hoc explanations but as training signals—it addresses three structural failure modes that limit GRPO: performance plateaus, failed self-reflection, and persistent failures.
The theoretical grounding via complexity reduction is genuine: CoT critiques decompose exponential search into linear search, and the empirical results reflect this. A +16.7% gain on AIME 2024 via self-critiquing, and state-of-the-art results on Qwen2.5-Math-7B-Base with only 4k training samples, are compelling evidence that the approach works.
That said, the evaluation is narrow (math/STEM only), compute costs are underreported, and the critique model dependency creates practical challenges. For practitioners, the main takeaway is: if you have a reliable critique source (rule-based or model-based), integrating it into online RL training can yield substantial gains over scalar-reward-only approaches. The policy shaping mechanism for refinements is the key engineering contribution that makes this work in practice.
Benchmark-by-Benchmark Analysis
To fully understand where Critique-GRPO gains and where gaps remain, let’s dissect Table 2 task by task (Qwen2.5-7B-Base, CoT Critique vs R1-GRPO):
| Benchmark | R1-GRPO | Critique-GRPO | Δ | Notes |
|---|---|---|---|---|
| MATH-500 | 74.00 | 77.80 | +3.8 | Moderate gain; easier problems where GRPO already works well |
| Minerva-MATH | 32.00 | 36.80 | +4.8 | Scientific math; critique useful for formula errors |
| OlympiadBench | 38.50 | 42.40 | +3.9 | Competition math; harder problems |
| AMC 2023 | 42.50 | 62.50 | +20.0 | Very large gain; multiple-choice with known distractors |
| AIME 2024 | 16.70 | 20.00 | +3.3 | Hard competition; absolute scores low but meaningful |
| TheoremQA | 40.60 | 44.00 | +3.4 | Cross-domain science; critique helps with domain transfer |
| GPQA-Diamond | 33.33 | 37.88 | +4.6 | Graduate-level science; significant OOD improvement |
| MMLU-Pro | 51.81 | 55.28 | +3.5 | General reasoning; consistent uplift |
The most striking single number: AMC 2023 jumps from 42.5% to 62.5% (+20 points). AMC problems have structured multiple-choice answers, which may make critique generation particularly effective—the model can compare its computed answer against the provided choices as an additional feedback signal. This hints that structured evaluation contexts may amplify Critique-GRPO’s benefits.
The smallest gains are on MATH-500 (the “easiest” subset) and AIME (the “hardest”). This makes intuitive sense: MATH-500 problems are well within GRPO’s reach—adding critique doesn’t help much on things the model can already solve through exploration. AIME problems are at the frontier of model capability, where even critiques may not provide enough guidance to escape persistent failure.
Scaling Behavior: Does More Data Help?
One underappreciated result: Critique-GRPO achieves 47.08% average Pass@1 on Qwen2.5-7B-Base with just 4k training examples. The numerical feedback baselines (SimpleRL-Zero, PRIME-Zero) use 46k examples and achieve 34-41% on the weaker math-specialized Qwen2.5-Math-7B-Base model.
This suggests Critique-GRPO has fundamentally better data efficiency—not just marginally better. The theoretical argument from Proposition 4.1 predicts exactly this: CoT critiques reduce the effective search space from to , meaning each training example does exponentially more work.
A natural follow-up question: does Critique-GRPO also scale better with more data? If we scale from 4k to 46k training prompts, does it maintain its lead? The paper doesn’t test this, but the complexity reduction argument suggests yes: the advantage is structural (exploration efficiency), not specific to low-data regimes.
Connection to the Broader Literature on Process Supervision
Critique-GRPO sits at an interesting intersection with process reward models (PRM) and outcome reward models (ORM):
- ORM (what standard GRPO uses): reward only at the final step; based on final answer correctness.
- PRM: intermediate rewards at each step, requiring step-level labels—expensive but informative.
- Critique-GRPO: CoT critique provides pseudo-step-level information (which step has the first error) without step-level labels. It bridges the information richness of PRM and the annotation efficiency of ORM.
This is a key insight: you don’t need manual step-level labels to get process-level guidance. A language model that can generate step-by-step critiques provides approximately the same information at a much lower annotation cost.
The limitation is that CoT critique correctness is only as reliable as the critique model—which may itself make step-level errors. The paper treats this as a solved problem by evaluating only final answer correctness, but the step-level quality of CoT critiques is an open question that future work should address.
My Take
The paper’s central insight—that natural language feedback can break the sample complexity bottleneck of binary reward RL—is theoretically sound and empirically well-supported. The three-failure-mode framing is honest and useful.
What I find most practically interesting is the self-critique result: a model trained with Critique-GRPO can generate and learn from its own critiques, with no external annotator or reward model. This is a meaningful step toward self-supervised improvement loops that could enable continuous autonomous learning post-deployment.
The AMC 2023 result (+20 points over GRPO) is particularly worth studying in follow-up work—it suggests structured problems with measurable intermediate states may see disproportionate benefit from critique-guided RL. Understanding why AMC benefits so much more than MATH-500 could unlock new design principles for curriculum construction in RL training.
The gaps I’d push hardest on in the next iteration: (1) scale to 70B+ models to see if the gains hold, (2) systematic evaluation on code/tool-use tasks, and (3) a rigorous treatment of critique reliability and its effect on training dynamics. Given the ICML Spotlight reception, this work will likely catalyze follow-ups on NLF-augmented RL across a wider range of tasks.
Reproducibility Notes
The paper releases code and models at https://github.com/zhangxy-2019/critique-GRPO. For readers wanting to reproduce the main results:
- Training framework: VERL (Versatile Efficient RL for LLMs)
- Base models: Qwen2.5-7B-Base, Qwen3-8B (via HuggingFace)
- Training data: OpenR1-Math-220k (46k subset, 4k randomly sampled)
- Critique model: GPT-4o for model-based CoT critiques; rule-based for Indicative and GT variants
- Key hyperparameters: group size , refinement size not explicitly stated in main paper, clip (standard GRPO), sampling temperature 0.7
The asynchronous rollout implementation via VERL is essential for computational efficiency—a synchronous implementation would be significantly slower due to the two-stage generation pipeline.
Appendix: Key Notation Reference
For readers who want to implement or extend this work, here is a consolidated reference table for all key symbols:
| Symbol | Meaning |
|---|---|
| Input question/query | |
| Policy model with parameters | |
| Reference (old) policy used for importance sampling | |
| The -th sampled initial response | |
| The -th critique-guided refined response | |
| Natural language critique for | |
| Binary reward: 1 if correct, 0 if incorrect | |
| Normalized advantage at token position of response | |
| Standard importance sampling ratio for initial responses | |
| Policy-shaped ratio for refined responses | |
| Policy shaping coefficient | |
| PPO clipping parameter | |
| Group size (number of initial responses per query) | |
| Refinement size (number of refined responses per query) | |
| Critique instruction prompt | |
| Refinement instruction prompt | |
| GRPO objective on initial responses | |
| GRPO objective on refined responses | |
| Total Critique-GRPO objective |