May 31, 2026 EN #Reinforcement Learning #LLM Training #Reasoning

Group Sequence Policy Optimization: A Sequence-Level RL Algorithm for Training Large Language Models

Review date: 2026-05-31 Review author: Zhongzhu Zhou Paper reviewed: Group Sequence Policy Optimization Paper authors: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin arXiv: 2507.18071v2, 2025-07-28 Venue/status: Technical report, Alibaba Qwen Team

Short Answer

This paper introduces Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm for training large language models. The core idea is deceptively simple: instead of computing importance ratios at the token level — as GRPO does — compute them at the sequence level.

The argument runs as follows. When a language model generates a response, the reward signal is attached to the entire sequence, not to individual tokens. GRPO nevertheless applies an importance sampling correction at each token position independently, as if each token were a separate sample from an independent distribution. The paper shows this is mathematically ill-posed: a single token is not a valid representative sample from the next-token distribution, so the per-token importance weight introduces high-variance noise that accumulates with sequence length and becomes catastrophic at the scale needed for large Mixture-of-Experts (MoE) models.

GSPO fixes this by defining the importance ratio at the sequence level. Given a group of $G$ responses $\{y_i\}_{i=1}^G$ sampled from the old policy $\pi_{\theta_{\text{old}}}$ , the sequence-level importance ratio for response $y_i$ is:

$s_i(\theta) = \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{\text{old}}}(y_i|x)}\right)^{\frac{1}{|y_i|}} = \exp\!\left(\frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\log\frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|x,y_{i,<t})}\right) \tag{Eq. 7}$

This is the geometric mean of per-token probability ratios, equivalently the exponential of the average log-ratio. The length normalization $1/|y_i|$ ensures that the ratio stays in a consistent numerical range regardless of response length, and that responses of different lengths compete on even footing within the clipping mechanism.

The objective then clips this single scalar per response rather than per token:

$\mathcal{J}_{\text{GSPO}}(\theta) = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \min\!\left(s_i(\theta)\hat{A}_i,\, \text{clip}(s_i(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_i\right)\right] \tag{Eq. 5}$

where the advantage $\hat{A}_i$ is the group-normalized reward:

$\hat{A}_i = \frac{r(x,y_i) - \text{mean}(\{r(x,y_i)\}_{i=1}^G)}{\text{std}(\{r(x,y_i)\}_{i=1}^G)} \tag{Eq. 6}$

The key effect: in GSPO, every token in a response $y_i$ receives exactly the same gradient weight $s_i(\theta)$ . In GRPO, each token receives a distinct weight $w_{i,t}(\theta) = \pi_\theta(y_{i,t}|x,y_{i,<t})/\pi_{\theta_\text{old}}(y_{i,t}|x,y_{i,<t})$ , which can range over $(0, 1+\varepsilon]$ or $[1-\varepsilon, +\infty)$ depending on the sign of the advantage. These heterogeneous weights accumulate and create instability. GSPO removes this factor by using a single uniform weight per response.

Empirically, GSPO achieves training stability and outperforms GRPO on AIME’24 and LiveCodeBench benchmarks with a cold-start model fine-tuned from Qwen3-30B-A3B-Base. Most importantly, it stabilizes MoE RL training — a problem that has previously required complex ad-hoc stabilization strategies or simply caused model collapse.

My reading of this paper is that the core contribution is a principled correction of the importance sampling mismatch in GRPO, motivated by a simple but overlooked observation: the unit of the optimization objective should match the unit of the reward signal. The paper is mathematically clean and the empirical results are compelling, but the experimental scope is narrow, and some important questions about the algorithm’s theoretical properties and boundary conditions are left unanswered.

1. Prerequisites

1.1 What is Reinforcement Learning for Language Models?

Reinforcement learning for language models (RLHF and its successors) is a training paradigm in which a language model is updated based on reward signals rather than token-prediction loss alone.

The setup is as follows. A language model, treated as a policy $\pi_\theta$ , takes a query $x$ as input and generates a response $y$ . A reward function $r(x, y)$ evaluates the response. The model’s parameters $\theta$ are updated to maximize the expected reward over the distribution of queries and responses:

$\max_\theta \; \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}\left[r(x, y)\right]$

In modern LLM post-training, the reward function can be:

A human-preference model (classic RLHF)
A verifier that checks mathematical correctness (e.g., for reasoning tasks)
A code execution result (pass/fail for programming tasks)

The policy generates entire responses — sequences of tokens — and the reward is typically attached to the whole response, not to individual tokens.

1.2 Why Not Just Use Supervised Fine-Tuning?

Supervised fine-tuning (SFT) trains the model to predict human-written tokens. It works well when we have high-quality demonstrations, but it cannot optimize for outcomes that are hard to demonstrate but easy to verify — like correctness of a mathematical proof or whether a code snippet passes test cases. RL fills this gap: the model can explore responses, receive outcome-based feedback, and learn from what works rather than only from what humans wrote.

1.3 Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) is the dominant RL algorithm used in RLHF since InstructGPT. The key challenge PPO solves is sample efficiency: generating on-policy rollouts for each gradient step is expensive, so it is desirable to reuse rollout data for multiple gradient updates. But using the same rollout samples for many updates drifts the policy away from the one that generated the data, invalidating the gradient estimates.

PPO handles this through importance sampling combined with a clipping mechanism. The importance sampling correction allows estimating the gradient under the current policy $\pi_\theta$ using samples from an old policy $\pi_{\theta_\text{old}}$ :

$\mathbb{E}_{y \sim \pi_\theta}[f(y)] = \mathbb{E}_{y \sim \pi_{\theta_\text{old}}}\!\left[\frac{\pi_\theta(y)}{\pi_{\theta_\text{old}}(y)} f(y)\right]$

PPO’s token-level objective is:

$\mathcal{J}_\text{PPO}(\theta) = \mathbb{E}\!\left[\frac{1}{|y|}\sum_{t=1}^{|y|}\min\!\left(w_t(\theta)\hat{A}_t,\; \text{clip}(w_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right] \tag{Eq. 1}$

where the per-token importance ratio is $w_t(\theta) = \pi_\theta(y_t|x, y_{<t}) / \pi_{\theta_\text{old}}(y_t|x, y_{<t})$ , and $\hat{A}_t$ is the advantage estimate from a separate value model.

PPO’s cost: it requires a value model of roughly the same size as the policy model, which doubles memory and compute requirements. For multi-billion-parameter LLMs, this is a severe practical constraint.

1.4 Group Relative Policy Optimization (GRPO)

GRPO (Shao et al., 2024) was introduced to eliminate the need for a value model. The insight: instead of estimating per-token advantages from a value model, compute the relative quality of a response by comparing it to a group of $G$ responses generated from the same query $x$ .

For query $x$ , GRPO generates $G$ responses $\{y_i\}_{i=1}^G \sim \pi_{\theta_\text{old}}(\cdot|x)$ , evaluates each with reward $r(x, y_i)$ , and normalizes within the group:

$\hat{A}_i = \hat{A}_{i,t} = \frac{r(x,y_i) - \text{mean}(\{r(x,y_j)\}_{j=1}^G)}{\text{std}(\{r(x,y_j)\}_{j=1}^G)}$

All tokens in response $y_i$ share this single advantage $\hat{A}_i$ . The GRPO objective is:

$\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\min\!\left(w_{i,t}(\theta)\hat{A}_{i,t},\; \text{clip}(w_{i,t}(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_{i,t}\right)\right] \tag{Eq. 2}$

where $w_{i,t}(\theta) = \pi_\theta(y_{i,t}|x, y_{i,<t}) / \pi_{\theta_\text{old}}(y_{i,t}|x, y_{i,<t})$ is the per-token importance ratio.

GRPO removes the value model and has been widely adopted (e.g., in DeepSeek-R1), but it exhibits severe instability when training large models on long-response tasks, often culminating in irreversible model collapse.

1.5 Importance Sampling: The Core Principle

Importance sampling estimates the expectation of a function $f$ under a target distribution $\pi_\text{tar}$ using samples drawn from a behavior distribution $\pi_\text{beh}$ :

$\mathbb{E}_{z \sim \pi_\text{tar}}[f(z)] = \mathbb{E}_{z \sim \pi_\text{beh}}\!\left[\frac{\pi_\text{tar}(z)}{\pi_\text{beh}(z)} f(z)\right] \tag{Eq. 4}$

This identity holds when the importance weight is computed over the same probability space as $f$ and the sampling distribution. In GRPO’s case, responses $y_i$ are sampled from $\pi_{\theta_\text{old}}(\cdot|x)$ as sequences, and the reward $r(x, y_i)$ is also a function of the entire sequence. But GRPO applies the importance weight at the level of individual tokens — each token $y_{i,t}$ is treated as if it were a separate sample from an independent next-token distribution. This mismatch between the sampling unit (sequence) and the correction unit (token) is the root cause of instability.

1.6 Mini-Batch Training and Off-Policy Drift

When training LLMs with RL, a large batch of rollout responses is generated from the old policy $\pi_{\theta_\text{old}}$ . For efficiency, this batch is split into multiple mini-batches, and multiple gradient steps are performed before generating new rollouts. Each gradient step moves $\theta$ further from $\theta_\text{old}$ , making the samples increasingly off-policy.

The clipping mechanism in PPO/GRPO/GSPO is designed to limit how far the policy can drift per update. When the importance ratio $w$ exceeds $1+\varepsilon$ (or falls below $1-\varepsilon$ ), the gradient contribution is zeroed out. This prevents very off-policy updates from dominating the gradient. However, if the importance ratio itself is poorly defined (as in GRPO’s token-level ratio), clipping interacts badly with the variance, as we will see.

1.7 Mixture-of-Experts (MoE) Models

MoE models replace dense feed-forward layers with a mixture of $E$ “expert” sub-networks. For each token, a gating mechanism selects a small subset of experts (typically 2 out of, say, 64) to process it. This allows the total parameter count to be large (enabling capacity) while keeping the active compute per token relatively small (enabling efficiency).

MoE models are relevant here because Qwen3-30B-A3B-Base — the model used in GSPO’s experiments — is a sparse MoE model with 30B total parameters but only ~3B active parameters per token. RL training of MoE models is especially prone to instability because expert load balancing and the sparse activation pattern create additional sources of gradient variance.

2. GRPO’s Flaw: A Misapplication of Importance Sampling

2.1 The Token-Level Ratio is Not a Valid Importance Weight

The fundamental issue with GRPO is that it applies importance sampling correction at the wrong granularity.

Recall the principle: importance sampling works when the weight $\pi_\text{tar}(z)/\pi_\text{beh}(z)$ corrects for the distributional mismatch between samples drawn from $\pi_\text{beh}$ and the expectation under $\pi_\text{tar}$ . Critically, this requires the samples $z$ to be drawn from $\pi_\text{beh}$ and the weight to be computed over the same probability space.

In GRPO, the responses $\{y_i\}$ are drawn as whole sequences from $\pi_{\theta_\text{old}}(\cdot|x)$ . The reward $r(x, y_i)$ is a function of the whole sequence. If we wanted a sequence-level importance weight, it would be $\pi_\theta(y_i|x)/\pi_{\theta_\text{old}}(y_i|x)$ .

Instead, GRPO applies weights $w_{i,t} = \pi_\theta(y_{i,t}|x, y_{i,<t})/\pi_{\theta_\text{old}}(y_{i,t}|x, y_{i,<t})$ at each token position $t$ . Each token $y_{i,t}$ is a single sample from $\pi_{\theta_\text{old}}(\cdot|x, y_{i,<t})$ — a distribution over the vocabulary at that context. A single draw from a discrete distribution does not provide a valid basis for importance sampling correction. For the correction to work, you would need to average over many draws from $\pi_{\theta_\text{old}}(\cdot|x, y_{i,<t})$ at each context, which is not what GRPO does.

Figure 1: GRPO vs. GSPO — Unit of Importance Ratio

┌──────────────────────────────────────────────────────────────────────────┐
│  GRPO: Token-Level Importance Ratio                                      │
│                                                                          │
│  Response y_i = [t1,  t2,  t3,  t4, ..., t_T]                          │
│  Reward:                                    r(x, y_i) ← single value    │
│                                                                          │
│  Importance ratios:                                                      │
│      w_{i,1} = pi_new(t1|ctx)/pi_old(t1|ctx)   ← varies per token!    │
│      w_{i,2} = pi_new(t2|ctx)/pi_old(t2|ctx)   ← varies per token!    │
│      ...                                                                 │
│      w_{i,T} = pi_new(tT|ctx)/pi_old(tT|ctx)   ← varies per token!    │
│                                                                          │
│  Objective: sum over tokens with heterogeneous weights → high variance  │
│                                                                          │
├──────────────────────────────────────────────────────────────────────────┤
│  GSPO: Sequence-Level Importance Ratio                                   │
│                                                                          │
│  Response y_i = [t1,  t2,  t3,  t4, ..., t_T]                          │
│  Reward:                                    r(x, y_i) ← single value    │
│                                                                          │
│  Sequence ratio (geometric mean):                                        │
│      s_i = exp(mean_t [log pi_new(t|ctx)/pi_old(t|ctx)])                │
│           = (pi_new(y_i|x)/pi_old(y_i|x))^{1/T}    ← one scalar!     │
│                                                                          │
│  Objective: all tokens weighted equally by s_i → stable, low variance  │
└──────────────────────────────────────────────────────────────────────────┘

2.2 Variance Accumulates With Sequence Length

In GRPO, the gradient involves a sum over $|y_i|$ token positions, each weighted by its own $w_{i,t}$ :

$\nabla_\theta \mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \hat{A}_i \cdot \frac{1}{|y_i|}\sum_{t=1}^{|y_i|} w_{i,t}(\theta) \nabla_\theta \log\pi_\theta(y_{i,t}|x, y_{i,<t})\right] \tag{Eq. 12}$

The per-token weights $w_{i,t}$ are not constants — they are random variables that depend on both the old and new policy at each token context. After a gradient step, $\pi_\theta$ has changed, so the weights for subsequent mini-batches are no longer close to 1. For sequences of length 1000+ (common in chain-of-thought reasoning tasks), the accumulated variance from these 1000+ distinct weights can be enormous.

Moreover, the interaction with clipping makes things worse. When $\hat{A}_i > 0$ (response $y_i$ is better than average), we want to increase its probability. The clipping mechanism prevents the update from being too large when $w_{i,t} > 1+\varepsilon$ . But if some tokens in the response have $w_{i,t} > 1+\varepsilon$ (clipped, contributing zero gradient) while others have $w_{i,t}$ near 1 (contributing non-zero gradient), the gradient signal is selectively applied to only a subset of tokens in the response, in a way that is driven by the noise of the token-level ratios rather than by the semantic importance of those tokens. This selective, noisy gradient accumulation is the mechanism through which model collapse emerges.

2.3 Why is Collapse Often Irreversible?

The paper reports an observation: once GRPO causes model collapse, reverting to a previous checkpoint and carefully tuning hyperparameters (clipping ranges, learning rates) does not reliably recover the model. This is characteristic of training dynamics that enter a catastrophic attractor.

The sequence-of-events intuition: noisy token-level weights cause the gradient to push some parts of the model toward high-entropy distributions (increasing output diversity), while other parts are pushed toward overconfident token predictions. This creates internal inconsistency in the model’s representations. Because the inconsistency manifests gradually across many layers (especially in MoE models where different experts handle different tokens), the resulting model state is far from any well-formed policy, and simply reverting the weights does not restore the internal structure of the distributions.

3. The GSPO Algorithm

3.1 Core Formulation

GSPO’s key observation: the sequence-level importance weight $\pi_\theta(y|x)/\pi_{\theta_\text{old}}(y|x)$ is a valid importance sampling weight, because it corrects for the distributional mismatch between responses drawn from $\pi_{\theta_\text{old}}(\cdot|x)$ and the expectation under $\pi_\theta(\cdot|x)$ . And since the reward is sequence-level, a sequence-level correction naturally aligns with the reward signal.

Figure 2: GSPO Training Loop Data-Flow

graph TD
    A["Query x ~ D"] --> B["Old Policy pi_old"]
    B --> C["Sample G responses: y1, y2, ..., yG"]
    C --> D["Reward: r(x, y1), ..., r(x, yG)"]
    D --> E["Group Advantage: A_hat_i = normalize(rewards)"]
    C --> F["Per-token log-ratios: log pi_new/pi_old for each token"]
    F --> G["Length-normalize: average over |y_i| tokens"]
    G --> H["Sequence ratio: s_i = exp(avg log-ratio)"]
    H --> I["Clipping: clip(s_i, 1-eps, 1+eps)"]
    I --> J["GSPO objective: min(s_i*A_i, clip*A_i) per response"]
    E --> J
    J --> K["Gradient update -> new theta"]
    K --> |"next mini-batch"| H
    K --> |"next rollout round"| B

The sequence-level importance ratio is:

$s_i(\theta) = \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_\text{old}}(y_i|x)}\right)^{1/|y_i|} = \exp\!\left(\frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\log\frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x,y_{i,<t})}\right) \tag{Eq. 7}$

Derivation step by step:

Start with the full sequence probability:

$\pi_\theta(y_i|x) = \prod_{t=1}^{|y_i|} \pi_\theta(y_{i,t}|x, y_{i,<t})$

Taking the ratio of new to old policy:

$\frac{\pi_\theta(y_i|x)}{\pi_{\theta_\text{old}}(y_i|x)} = \prod_{t=1}^{|y_i|} \frac{\pi_\theta(y_{i,t}|x, y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x, y_{i,<t})}$

This product grows or shrinks exponentially with sequence length. A 500-token response where each token has ratio 0.99 gives $0.99^{500} \approx 0.007$ — near zero even for tiny per-token changes. To keep $s_i$ in a usable range and make clipping length-independent, take the geometric mean (equivalent to the $1/|y_i|$ exponent):

$s_i(\theta) = \left(\prod_{t=1}^{|y_i|} \frac{\pi_\theta(y_{i,t}|x, y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x, y_{i,<t})}\right)^{1/|y_i|}$

Using $\log$ to convert the product to a sum:

$\log s_i(\theta) = \frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\log\frac{\pi_\theta(y_{i,t}|x, y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x, y_{i,<t})}$

This is the arithmetic mean of per-token log-ratios, which is exactly the average log-likelihood ratio. Exponentiating gives Equation 7.

Why does the $1/|y_i|$ normalization matter?

Consider two responses: a 50-token response and a 500-token response. Without length normalization, the sequence ratio for the 500-token response would have 10× more multiplicative factors, pushing it much further from 1. The clipping mechanism would clip the 500-token response at a different effective policy distance than the 50-token response. Length normalization makes clipping length-invariant: responses of any length are clipped at the same “distance” from the old policy in average log-likelihood-ratio units.

3.2 The GSPO Objective

$\mathcal{J}_\text{GSPO}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, \{y_i\} \sim \pi_{\theta_\text{old}}}\!\left[\frac{1}{G}\sum_{i=1}^G \min\!\left(s_i(\theta)\hat{A}_i,\; \text{clip}(s_i(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_i\right)\right] \tag{Eq. 5}$

The min/clip structure is borrowed from PPO. It serves two purposes:

Lower bound: When $\hat{A}_i > 0$ (response $y_i$ is better than average), the objective is: $\min(s_i \hat{A}_i,\; (1+\varepsilon)\hat{A}_i)$ This is $s_i\hat{A}_i$ when $s_i \leq 1+\varepsilon$ , and $(1+\varepsilon)\hat{A}_i$ when $s_i > 1+\varepsilon$ . The update is capped: the policy cannot gain too much benefit from a single off-policy response.
Upper bound: When $\hat{A}_i < 0$ (response $y_i$ is worse than average), the objective is: $\min(s_i \hat{A}_i,\; (1-\varepsilon)\hat{A}_i)$ This is $s_i\hat{A}_i$ when $s_i \geq 1-\varepsilon$ , and $(1-\varepsilon)\hat{A}_i$ when $s_i < 1-\varepsilon$ .

The effect: samples that are “too far” from the old policy (in either direction) are excluded from gradient estimation, preventing runaway updates.

Figure 3: Clipping Mechanics — GRPO Token-Level vs. GSPO Sequence-Level

GRPO (per token, advantage = A_hat_i for all tokens in y_i):

  Token t1:  w_{i,1} = 0.8   → within [0.8, 1.2] for eps=0.2 → contributes to gradient
  Token t2:  w_{i,2} = 1.5   → above 1.2 → CLIPPED → zero gradient contribution
  Token t3:  w_{i,3} = 0.3   → below 0.8 → CLIPPED → zero gradient contribution
  Token t4:  w_{i,4} = 1.1   → within [0.8, 1.2]  → contributes to gradient
  ...

  Result: only some tokens contribute gradient, selected by noisy per-token ratios.
          Pattern is unpredictable and varies across mini-batches.

GSPO (per sequence, single s_i):

  s_i = exp(mean log-ratio over all tokens) = 0.95   → within [0.8, 1.2]

  All tokens t1, t2, t3, ..., t_T weighted equally by 0.95.
  Either ALL tokens contribute (s_i in range) or NONE (s_i out of range).

  Result: coherent, all-or-nothing update per response.

3.3 Gradient Analysis: Why GSPO Is More Stable

The gradient of the GSPO objective (omitting clipping for clarity) is:

$\nabla_\theta \mathcal{J}_\text{GSPO}(\theta) = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_\text{old}}(y_i|x)}\right)^{1/|y_i|} \hat{A}_i \cdot \frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\nabla_\theta\log\pi_\theta(y_{i,t}|x,y_{i,<t})\right] \tag{Eq. 10}$

Compare with GRPO’s gradient:

$\nabla_\theta \mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \hat{A}_i \cdot \frac{1}{|y_i|}\sum_{t=1}^{|y_i|} \frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x,y_{i,<t})} \nabla_\theta\log\pi_\theta(y_{i,t}|x,y_{i,<t})\right] \tag{Eq. 12}$

The structural difference is critical:

GRPO: each token’s gradient contribution is scaled by its own $w_{i,t}$ . Since $w_{i,t}$ varies across tokens, the gradient sums up contributions with heterogeneous weights. This introduces variance proportional to $\text{Var}[w_{i,t}] \times |y_i|$ .
GSPO: all tokens’ gradient contributions are scaled by the same $s_i$ . The variance from the weighting is $\text{Var}[s_i]$ , independent of sequence length. This is the key variance reduction.

Figure 4: Gradient Variance Scaling with Sequence Length

Gradient variance in GRPO:
  Var[∇J_GRPO] ∝ (1/|y|) * sum_t Var[w_{i,t} * grad_t]
                ≈ Var[w_{token}] * (something that grows or stays constant with |y|)
  
  As |y| increases, noise from token-level ratios can accumulate.
  For long reasoning chains (|y| ~ 2000 tokens), this becomes catastrophic.

Gradient variance in GSPO:
  Var[∇J_GSPO] ∝ Var[s_i * (1/|y|) * sum_t grad_t]
               = Var[s_i] * Var[average gradient]
  
  s_i is a single scalar per response. Its variance does NOT grow with |y|.
  In fact, s_i is the exponential of the arithmetic mean of |y| terms,
  so by the CLT, its distribution concentrates as |y| grows → lower variance!
  
  Counterintuitively, GSPO may become MORE stable for longer sequences.

3.4 GSPO-token: A Flexible Token-Level Variant

For use cases like multi-turn RL, where different turns or different token spans should receive different advantage values, GSPO introduces a variant called GSPO-token:

$\mathcal{J}_\text{GSPO-token}(\theta) = \mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\min\!\left(s_{i,t}(\theta)\hat{A}_{i,t},\; \text{clip}(s_{i,t}(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_{i,t}\right)\right] \tag{Eq. 13}$

where $s_{i,t}(\theta)$ uses a stop-gradient trick to separate the scalar weight from the gradient flow:

$s_{i,t}(\theta) = \text{sg}[s_i(\theta)] \cdot \frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\text{sg}[\pi_\theta(y_{i,t}|x,y_{i,<t})]} \tag{Eq. 14}$

Here $\text{sg}[\cdot]$ is the stop-gradient operator (PyTorch detach). The term $\pi_\theta(y_{i,t})/\text{sg}[\pi_\theta(y_{i,t})]$ numerically equals 1 but carries gradient through $\pi_\theta$ . Combined with $\text{sg}[s_i(\theta)]$ , the overall scaling is numerically $s_i(\theta)$ per token.

Why is this needed? When $\hat{A}_{i,t} = \hat{A}_i$ for all $t$ (as in the standard sequence-level reward case), GSPO-token is numerically identical to GSPO (same objective, same gradient, same clipping condition). But when $\hat{A}_{i,t}$ varies per token — e.g., different segments of a multi-turn conversation have different rewards — GSPO-token enables the finer-grained advantage signal while still using the stable sequence-level ratio $s_i$ as the uniform weight.

Pseudocode for GSPO:

Algorithm 1: GSPO Training Step
Input: current policy pi_theta, query distribution D, group size G,
       clipping range eps, reward function r

1. Sample batch of queries {x_j} from D
2. For each query x:
   a. Generate G responses {y_1, ..., y_G} from pi_{theta_old}(.|x)
   b. Compute rewards {r_1, ..., r_G} = r(x, y_1), ..., r(x, y_G)
   c. Compute group advantage for each response i:
      A_hat_i = (r_i - mean({r_j})) / std({r_j})
3. Split responses into mini-batches for gradient updates
4. For each mini-batch:
   a. For each response y_i:
      - Compute per-token log-ratios:
        log_ratio_t = log pi_theta(y_{i,t}|x,y_{i,<t}) - log pi_{theta_old}(y_{i,t}|x,y_{i,<t})
      - Compute sequence ratio:
        s_i = exp( (1/|y_i|) * sum_t log_ratio_t )
      - Compute clipped objective:
        L_i = min(s_i * A_hat_i, clip(s_i, 1-eps, 1+eps) * A_hat_i)
   b. Gradient step: theta = theta + alpha * grad_theta mean_i(L_i)
5. Update pi_{theta_old} <- pi_theta (after rollout is exhausted)
Output: updated policy pi_theta

4. Experiments

4.1 Experimental Setup

The paper evaluates GSPO using a cold-start model: Qwen3-30B-A3B-Base fine-tuned with some initial supervised or behavioral cloning. This is a sparse MoE model with 30B total parameters and approximately 3B active parameters per forward pass.

Benchmarks:

AIME’24: The 2024 American Invitational Mathematics Examination, a competition-level mathematics benchmark. Metric: average Pass@1 over 32 samplings.
LiveCodeBench (202410-202502): A code generation benchmark covering problems from competitive programming contests during October 2024 to February 2025. Metric: average Pass@1 over 8 samplings.

The experimental protocol compares GSPO to GRPO starting from the same checkpoint, with both algorithms training on the same queries and using the same reward function.

4.2 Training Stability

The headline result is that GRPO exhibits catastrophic instability when training the Qwen3-30B-A3B-Base MoE model, while GSPO trains stably throughout.

Figure 5: Training Stability Schematic (GRPO Collapse vs. GSPO Stability)

GRPO training curve (training reward):
  ───────────────────────────────────────
  Training reward / model performance

  0.60 │     /^^^^^\
  0.50 │    /       \       ← reward peak
  0.40 │   /         \
  0.30 │──/           \                  ← collapse begins
  0.20 │               \___...↓
  0.10 │                              ↓
  0.00 │_______________________________________ steps
       
  Pattern: initial improvement followed by sudden collapse.
  Often irreversible; reverting checkpoint does not fix.

GSPO training curve (training reward):
  ───────────────────────────────────────
  Training reward / model performance

  0.60 │                          /─────
  0.50 │               /─────────/
  0.40 │      /───────/
  0.30 │─────/
  0.20 │
  0.10 │
  0.00 │_______________________________________ steps
  
  Pattern: monotone improvement, stable, no collapse.

The stability difference is especially pronounced in MoE training. Dense model RL training with GRPO can sometimes be stabilized with careful hyperparameter tuning, but MoE models are far more sensitive. GSPO stabilizes MoE RL training inherently, without requiring additional tricks.

4.3 Performance Results

On both AIME’24 and LiveCodeBench, the GSPO-trained model outperforms the GRPO-trained model. The performance advantage has two components:

Stability component: GSPO avoids collapse, so it can train for longer, accumulating more improvement.
Algorithmic component: Even at the same number of training steps (before GRPO collapses), GSPO shows superior sample efficiency — the model improves faster per gradient step.

The paper attributes the exceptional performance of Qwen3 models to GSPO, suggesting that these results transfer directly to production-scale systems.

4.4 Infrastructure Implications

The paper notes that GSPO “has the potential for simplifying the design of RL infrastructure.” This is because the sequence-level importance ratio $s_i(\theta)$ is computed from the same log-probability values that a standard language model forward pass produces. No additional value model or separate neural network is needed beyond the policy itself. The per-response clipping also simplifies bookkeeping compared to per-token clipping.

5. Limitations and Boundary Conditions

5.1 Length Normalization Removes Length Information

The geometric mean formulation $s_i(\theta)^{1/|y_i|}$ normalizes out the sequence length. This is good for variance stability but it means that a 50-token response and a 500-token response with the same per-token average log-ratio are treated identically by the clipping mechanism. In situations where response length itself is a quality signal (e.g., concise answers should be preferred over verbose ones), this normalization may not be appropriate without additional reward shaping.

5.2 Group Size G and Statistical Reliability

The group-based advantage $\hat{A}_i$ is estimated from $G$ responses. When $G$ is small (say, $G = 4$ ), the sample mean and standard deviation of rewards within a group are noisy estimates of the true relative quality. GSPO inherits this limitation from GRPO. The paper does not discuss how large $G$ needs to be for the advantage estimates to be reliable, or how $G$ interacts with model size and task difficulty.

5.3 MoE vs. Dense Models

The reported stability improvements are demonstrated on a MoE model (Qwen3-30B-A3B-Base). Dense model behavior may differ. For dense models, GRPO may not collapse as catastrophically because the gradient variance from token-level ratios does not interact with the sparse activation pattern of MoE routing. The generality of GSPO’s stability benefits to all architectures is not fully established.

5.4 Clipping Range Sensitivity

The paper notes that GSPO and GRPO use clipping ranges $\varepsilon$ that “differ in order of magnitude” due to the different definitions of the importance ratio. The sequence-level ratio $s_i$ is in a different numerical range than the token-level ratio $w_{i,t}$ , requiring a different $\varepsilon$ . The paper does not provide guidelines for setting $\varepsilon$ in GSPO, nor does it ablate over different values.

6. Critical Assessment: Weaknesses & Improvements

6.1 Weaknesses and Flaws

(a) Narrow Baseline Comparison

GSPO is only compared against GRPO. The paper discusses PPO extensively in the preliminaries (Section 2), presenting it as the gold standard of RL algorithms, but never actually runs a PPO comparison in the experiments. This is a significant gap: if the reader’s takeaway is “GSPO > GRPO,” the natural follow-up is “compared to PPO with a value model, how does GSPO stand?” Given that the paper proposes GSPO partly to fix PPO’s scalability problems, it is surprising that no direct PPO comparison appears.

(b) Only One Model Architecture Tested

All experimental results come from a single model: Qwen3-30B-A3B-Base. This is a sparse MoE model. The paper never tests GSPO on a dense model (e.g., a dense Qwen2 or LLaMA-style architecture), on a smaller model (where collapse dynamics may differ), or on a non-Qwen architecture. The claim that GSPO “inherently resolves the stability challenges” of MoE RL training cannot be generalized from a single model and a single organization’s infrastructure.

(c) No Ablation Studies

The paper presents GSPO as a combination of design decisions: sequence-level ratio, length normalization, group-based advantage. None of these components are ablated:

What if you use the raw sequence ratio $\pi_\theta(y_i|x)/\pi_{\theta_\text{old}}(y_i|x)$ without length normalization?
What if you use the sequence ratio but apply it token-wise (i.e., scale each token’s gradient by the same constant $s_i$ but clip each token separately)? How does this compare to GSPO’s response-level clipping?
What if you keep GRPO’s token-level ratio but clip at the response level?

Without these ablations, it is impossible to identify which component of GSPO is driving the stability improvement.

(d) Compute and Memory Costs Not Reported

The paper claims GSPO “has the potential for simplifying RL infrastructure” and implies that removing the value model reduces costs. But it does not report actual compute costs, memory usage, throughput (tokens per second), or wall-clock training time for either GRPO or GSPO. The “infrastructure simplification” argument is therefore qualitative and unverifiable from the paper alone.

(e) The “MoE Stabilization” Attribution Is Confounded

The paper states that GSPO “inherently resolved the stability challenges in the RL training of large Mixture-of-Experts (MoE) models.” But no ablation separates GSPO’s algorithmic design from the specific training setup used for Qwen3 (data curriculum, reward model, learning rate schedule, rollout strategy). The stability could be partly or entirely due to engineering decisions outside the GSPO objective. Without a controlled comparison — same model, same training setup, only the importance ratio definition changed — the attribution is speculative.

(f) No Theoretical Convergence Analysis

PPO has theoretical justification through the monotone improvement theorem (Schulman et al., 2015). GRPO’s convergence properties have been studied in subsequent work. GSPO provides gradient analysis (Sections 4.2–4.3) showing the structural difference in gradient weighting, but no convergence guarantee, no rate analysis, and no formal stability proof. The paper’s theoretical contribution is limited to showing that sequence-level importance sampling is more principled than token-level, which is an important observation but falls short of a convergence theorem.

(g) Reward Range Assumption

The paper defines the reward as $r(x, y) \in [0, 1]$ (Section 2). This is appropriate for verifier-based rewards (math, code). It may not apply to RL with learned reward models (e.g., preference models), which can output unbounded scores. The interaction of GSPO’s clipping with unbounded or multimodal reward distributions is not discussed.

6.2 Limitations the Authors Understate or Omit

(a) Length Normalization May Harm Length-Critical Tasks

Section 4.1 of the paper briefly notes that length normalization is used “to reduce the variance and to control $s_i$ within a unified numerical range.” The authors do not acknowledge a potential downside: for tasks where response length is itself informative (e.g., math problems where longer reasoning chains tend to be more correct, or conversational tasks where verbosity is penalized), normalizing out length removes a signal that the importance ratio could otherwise carry. No experiment tests whether GSPO’s length normalization hurts performance on length-sensitive tasks.

(b) Small Group Size Variance Is Not Quantified

The group size $G$ determines how reliable the advantage estimates are. For $G=4$ , the sample standard deviation from 4 rewards is a noisy estimator. The paper does not discuss the minimum recommended group size, the effect of reward variance (e.g., tasks where most responses get reward 0 or 1) on the advantage estimates, or whether the advantage normalization breaks down when all $G$ responses receive the same reward (standard deviation is 0, causing numerical issues).

(c) Infrastructure Simplification Is Conditional

The claim about “simplifying RL infrastructure” assumes that GSPO can operate without a value model (unlike PPO). This is true — GSPO, like GRPO, is value-model-free. But production-scale RLHF often requires a value model not just for advantage estimation, but also for early stopping, debugging, and monitoring training health. Removing the value model may simplify some aspects while complicating others.

(d) Clipping Range Setting Is Unguided

The paper notes that GSPO’s clipping range $\varepsilon$ “typically differs in order of magnitude” from GRPO’s. If a practitioner is switching from GRPO to GSPO, they need to re-tune $\varepsilon$ . The paper offers no empirical guidance — no sweep over values, no rule of thumb for setting $\varepsilon$ based on model size, task, or typical token probability ratios.

6.3 Concrete Improvement Suggestions

(a) Add PPO as a Baseline

Run PPO (with a value model initialized from the policy) against GSPO on the same benchmark, same model, same compute budget. This would directly answer whether GSPO achieves PPO-quality results without the value model cost. Without this comparison, the paper’s practical significance is unclear.

(b) Ablation Over Each Design Choice

Run four variants in parallel on the same model and task:

GRPO (baseline, token-level ratio, token-level clipping)
GRPO with response-level clipping (sequence ratio applied via token-level average, but clip at response level)
GSPO without length normalization (raw sequence ratio, not normalized by $1/|y_i|$ )
Full GSPO (sequence ratio, length-normalized, response-level clipping)

This 2×2 design would identify whether length normalization or response-level clipping (or both) drives stability.

(c) Test on Dense Models and Non-Qwen Architectures

Demonstrate GSPO’s stability advantage on a dense model (e.g., LLaMA-3-8B) and a non-Alibaba architecture. This would establish whether the stability benefit is architecture-agnostic.

(d) Report Compute and Memory Benchmarks

Add a table comparing GRPO and GSPO in terms of GPU memory usage, tokens-per-second throughput, and wall-clock training time per step. This would quantify the “infrastructure simplification” claim.

(e) Analyze Token-Level Gradient Distribution

For a training run under each algorithm, log the distribution of $w_{i,t}$ (GRPO) and $s_i$ (GSPO) per step. Plotting histograms or variance over training steps would empirically confirm the variance reduction hypothesis and would help practitioners understand how the ratio distribution evolves during training.

(f) Provide $\varepsilon$ Setting Guidelines

Based on typical per-token log-probability ratios observed in practice, derive a recommended initial $\varepsilon$ for GSPO and provide a tuning strategy. For example: run a short diagnostic rollout, measure the distribution of $s_i$ values, and set $\varepsilon$ so that the initial clip rate is approximately X% of responses.

7. Conclusion

GSPO is a principled correction to GRPO’s misapplication of importance sampling. The key insight — that the unit of the optimization objective should match the unit of the reward signal — is simple but has a large practical impact: it eliminates the token-level gradient variance that causes instability in GRPO, especially for large MoE models.

The algorithm is elegant: a single geometric-mean importance ratio per response replaces $|y_i|$ per-token ratios, and response-level clipping replaces token-level clipping. The gradient analysis shows that this makes every token’s update weight uniform within a response, removing a primary source of training noise.

The empirical results on Qwen3 are compelling. GSPO stabilizes MoE RL training where GRPO collapses, and delivers superior performance on AIME’24 and LiveCodeBench. The real-world validation — GSPO’s deployment in Qwen3 production training — provides strong evidence of practical utility.

However, the paper’s experimental scope is narrow: one model, two benchmarks, no PPO comparison, no ablations. The theoretical analysis is insightful but not a convergence guarantee. Practitioners who want to adopt GSPO should expect to re-tune the clipping range $\varepsilon$ , should verify stability on their specific architecture, and should treat the “infrastructure simplification” claim as a qualitative direction rather than a quantified result.

The core principle — sequence-level importance ratio matched to sequence-level reward — is likely to influence the design of future RL algorithms for LLMs. Whether GSPO itself becomes the dominant algorithm, or whether it serves as an intermediate step toward better-understood methods, it represents a meaningful advance in the theoretical foundations of LLM reinforcement learning.

References

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms.
Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. (GRPO paper)
OpenAI (2024). OpenAI o1 Technical Report.
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
Qwen Team (2025a). Qwen3 Technical Report.
Qwen Team (2025b). QwQ: Reflect Deeply on the Boundaries of the Unknown.
Zheng, C. et al. (2023). (Sequence-level importance ratio prior work, referenced in GSPO Section 4.1)
MiniMax (2025). MiniMax-Text-01: Scaling Frontier Language Models with Lightning Attention.

Appendix A: Detailed Comparison Table — PPO, GRPO, GSPO

Property	PPO	GRPO	GSPO
Importance ratio unit	Per token $w_t$	Per token $w_{i,t}$	Per sequence $s_i$ (geo-mean)
Advantage estimation	Value model	Group reward normalization	Group reward normalization
Value model required	Yes	No	No
Clip unit	Per token	Per token	Per sequence
Reward signal unit	Per token (advantage)	Per sequence (reward)	Per sequence (reward)
Gradient weight per token	Varies ( $w_t$ )	Varies ( $w_{i,t}$ )	Uniform ( $s_i$ )
Stability at large scale	Good (with value model)	Unstable for large MoE	Stable
Infrastructure complexity	High (two models)	Low	Low
Convergence theory	Yes (TRPO basis)	Partial	Not yet
Multi-turn support	Yes	With hacking	Yes (GSPO-token)
Length sensitivity	Implicit (via advantage)	Length-agnostic (per-token avg)	Length-normalized (explicit)

The table makes clear that GSPO occupies a middle ground: it matches GRPO’s infrastructure simplicity (no value model) while addressing GRPO’s core instability. What it does NOT have, compared to PPO, is a theoretical convergence guarantee and a value model signal for advantage estimation on complex tasks.

Appendix B: Why Geometric Mean? Alternatives and Their Problems

The choice of geometric mean for the sequence-level importance ratio deserves further scrutiny. Several alternatives could be considered:

B.1 Raw Sequence Ratio (No Normalization)

$r_i^{\text{raw}}(\theta) = \frac{\pi_\theta(y_i|x)}{\pi_{\theta_\text{old}}(y_i|x)} = \prod_{t=1}^{|y_i|} \frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x,y_{i,<t})}$

Problem: Grows or shrinks exponentially with $|y_i|$ . For $|y_i| = 1000$ tokens and average per-token ratio $1.001$ , the sequence ratio is $e^{0.001 \times 1000} = e \approx 2.7$ . For average ratio $0.999$ , it is $e^{-1} \approx 0.37$ . Responses of different lengths would be in completely different numerical ranges. The clipping mechanism could not use a single $\varepsilon$ for all responses.

B.2 Arithmetic Mean of Ratios

$r_i^{\text{arith}}(\theta) = \frac{1}{|y_i|}\sum_{t=1}^{|y_i|} \frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x,y_{i,<t})}$

Problem: The arithmetic mean of probabilities does not recover the sequence likelihood, and large outlier tokens (where $w_{i,t}$ is far from 1) can disproportionately dominate. The arithmetic mean is also always $\geq$ the geometric mean (by AM-GM inequality), so it tends to overestimate the policy shift for typical training trajectories.

B.3 Geometric Mean (GSPO’s Choice)

$s_i(\theta) = \left(\prod_{t=1}^{|y_i|} \frac{\pi_\theta(y_{i,t}|x,y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t}|x,y_{i,<t})}\right)^{1/|y_i|}$

Advantages:

Connects directly to sequence log-likelihood (via average log-ratio)
Length-normalized: the same $\varepsilon$ works for all sequence lengths
Robust to individual token outliers (geometric mean shrinks large outliers)
By CLT, the distribution of $\log s_i$ concentrates as $|y_i|$ grows

B.4 Clipping Based on KL Divergence

An alternative to clipping on the importance ratio would be clipping based on the KL divergence $\text{KL}[\pi_{\theta_\text{old}}(\cdot|x) || \pi_\theta(\cdot|x)]$ , as in the TRPO/PPO connection. However, computing the KL over the full response distribution requires marginalizing over all possible sequences, which is intractable. The sequence likelihood ratio serves as a practical proxy.

Appendix C: Practical Implementation Details

C.1 Computing $s_i$ in Practice

In PyTorch, given log-probabilities from the current and old policy:

# log_probs_new: shape [batch, seq_len]   (log pi_theta)
# log_probs_old: shape [batch, seq_len]   (log pi_theta_old)
# mask: shape [batch, seq_len]            (1 for non-padding tokens)

log_ratio = log_probs_new - log_probs_old          # per-token log-ratio
seq_len = mask.sum(dim=-1, keepdim=True)           # actual length (float)
avg_log_ratio = (log_ratio * mask).sum(dim=-1) / seq_len.squeeze(-1)
s_i = avg_log_ratio.exp()                          # sequence-level ratio

# Clipped objective (advantage is A_hat, shape [batch])
clipped_s = s_i.clamp(1 - epsilon, 1 + epsilon)
loss = -torch.min(s_i * A_hat, clipped_s * A_hat).mean()

Key point: the stop-gradient for the old policy log-probs is automatically handled because log_probs_old is a constant (computed without gradient tracking) from the previous rollout.

C.2 Setting the Clipping Range $\varepsilon$

Because $s_i$ is the geometric mean of per-token ratios, and per-token ratios are close to 1 for small updates, $s_i$ will also be close to 1. But the “closeness” depends on learning rate and model scale. A practical heuristic:

Run a diagnostic rollout with the current hyperparameters and compute the distribution of $s_i$ values across a typical mini-batch.
If the 95th percentile of $|s_i - 1|$ is $\delta$ , set $\varepsilon$ roughly at the 90th percentile to allow most responses to participate in the update.
The paper reports that typical $\varepsilon$ for GSPO is “orders of magnitude different from GRPO,” suggesting $\varepsilon \approx 0.01$ – $0.05$ for GSPO vs. $\varepsilon \approx 0.1$ – $0.2$ for GRPO.

C.3 Group Size Considerations

The group size $G$ affects the quality of the advantage estimates. Practical recommendations:

Setting	Typical $G$ range	Notes
Small models (7B)	4–8	Sufficient for typical reward variance
Large models (30B+)	8–16	Larger $G$ amortizes rollout cost
High-variance rewards (sparse)	16–32	Need more samples to estimate statistics reliably
Low-variance rewards (dense)	4–8	Group statistics are reliable with fewer samples

When $\text{std}(\{r_j\}) \approx 0$ (all responses receive the same reward), the advantage is numerically unstable. A common fix is to add a small constant $\epsilon_\text{std}$ to the denominator: $\hat{A}_i = (r_i - \mu) / (\sigma + \epsilon_\text{std})$ .

Appendix D: Relation to Other Recent RL Algorithms for LLMs

The landscape of LLM RL training algorithms has evolved rapidly. Here is where GSPO fits:

REINFORCE / REINFORCE with baseline: The classic algorithm. Updates the policy by $\nabla_\theta \log\pi_\theta(y|x) \cdot r(x,y)$ , or with a baseline subtracted. No importance sampling, so no reuse of rollout data across mini-batches. High variance.

PPO: Adds importance sampling for multi-step updates and clip for stability. Requires a value model. The gold standard for RLHF but expensive at scale.

GRPO (DeepSeek, 2024): Removes the value model, uses group reward normalization. Fast and practical, but unstable for large models due to token-level importance ratio misapplication.

GSPO (Qwen, 2025): Fixes GRPO’s importance ratio to sequence-level. Stable, value-model-free, practical.

DAPO / VINO / other variants: Several variants have emerged that modify the advantage estimation, reward normalization, or objective structure. GSPO’s sequence-level ratio is orthogonal to most of these and could potentially be combined with them.

Figure 6: Timeline of LLM RL Algorithms

2017 │  PPO (Schulman et al.)
     │  ↓ token-level ratio, value model, stable
2022 │  InstructGPT (OpenAI) — PPO for RLHF, large scale
     │
2024 │  GRPO (DeepSeekMath) — value-model-free, but unstable at scale
     │  ↓ token-level ratio, group advantage
     │
     │  DeepSeek-R1 — uses GRPO for reasoning, triggers wide adoption
     │  but also reveals stability problems at larger scales
     │
2025 │  GSPO (Qwen Team) — sequence-level ratio, stable for large MoE
     │  ↓ deployed in Qwen3 production training
     │
     │  [Many variants: DAPO, VINO, etc. — ongoing active area]

The trend is clear: the field is moving from PPO (accurate but expensive) through GRPO (cheap but unstable) toward algorithms like GSPO that are cheap and stable. The remaining open question is whether stable value-model-free algorithms can also provide reliable advantage estimates for tasks with complex credit assignment.

Appendix E: Reproducing GSPO — Practical Checklist

For practitioners who want to implement GSPO:

Data: Collect rollouts from the old policy $\pi_{\theta_\text{old}}$ . Store log-probabilities of each generated token under $\pi_{\theta_\text{old}}$ (to be used as the denominator in $s_i$ ).
Reward function: Must produce a scalar $r(x, y) \in \mathbb{R}$ per response. Binary rewards (0/1 for correctness) and continuous rewards both work. Ensure rewards are bounded or normalized to prevent advantage explosion.
Group construction: For each query $x$ , generate $G$ responses. Shuffle responses randomly before splitting into mini-batches to avoid systematic bias in advantage estimates.
Advantage normalization: Compute mean and std of rewards within each group of $G$ responses. Clip std from below at a small $\epsilon_\text{std}$ (e.g., $10^{-6}$ ) to prevent division by zero.
Sequence ratio computation: For each response $y_i$ in a mini-batch, compute $s_i$ using current $\theta$ ‘s log-probs and stored $\pi_{\theta_\text{old}}$ log-probs. Apply $1/|y_i|$ normalization.
Clipping: Apply clip $(s_i, 1-\varepsilon, 1+\varepsilon)$ . Start with $\varepsilon \approx 0.02$ – $0.05$ and tune based on clip rate diagnostics.
Gradient computation: $L_i = \min(s_i \hat{A}_i, \text{clip}(s_i, 1-\varepsilon, 1+\varepsilon)\hat{A}_i)$ . Average over responses in the mini-batch.
Update $\pi_{\theta_\text{old}}$ : After completing all mini-batch updates for a rollout, copy current $\theta$ to $\theta_\text{old}$ before the next rollout.
Monitoring: Track the distribution of $s_i$ values per step (should stay near 1), the clip rate (what fraction of responses are clipped, should be 5–20%), and the reward improvement per rollout step.
Instability signals: If the average response length explodes (model starts generating very long responses to avoid low rewards) or reward suddenly drops to near-zero, check whether $\varepsilon$ needs adjustment and whether the old log-probs are being stored correctly.

Appendix F: Deep-Dive — Understanding the Clip Mechanism Through a Worked Example

To make the clip mechanics concrete, consider a mini-batch with three responses to the same query $x$ , with group size $G = 3$ .

Setup:

| Response | Length $|y_i|$ | Reward $r_i$ | Advantage $\hat{A}_i$ | |---|---|---|---| | $y_1$ | 200 | 1.0 | +1.13 | | $y_2$ | 180 | 0.5 | −0.13 | | $y_3$ | 250 | 0.0 | −1.00 |

(Advantage computed as: mean = 0.5, std = 0.408, $\hat{A}_1 = (1.0-0.5)/0.441 \approx +1.13$ , etc.)

After one gradient step, suppose the sequence ratios are:

$s_1 = 1.04$ (current policy is slightly more likely to produce $y_1$ )
$s_2 = 0.97$ (slightly less likely to produce $y_2$ )
$s_3 = 1.08$ (notably more likely to produce $y_3$ , bad!)

With $\varepsilon = 0.05$ :

Response	$s_i$	$\hat{A}_i$	$s_i \hat{A}_i$	$\text{clip}(s_i)\hat{A}_i$	$\min(\cdot)$
$y_1$	1.04	+1.13	+1.175	$(1.05)(+1.13) = +1.187$	+1.175
$y_2$	0.97	−0.13	−0.126	$(0.95)(−0.13) = −0.124$	−0.124
$y_3$	1.08	−1.00	−1.080	$(0.95)(−1.00) = −0.950$	−0.950

For $y_3$ : $s_3 = 1.08 > 1+\varepsilon = 1.05$ , and $\hat{A}_3 = -1.00 < 0$ . The policy is moving in the wrong direction for a bad response (making $y_3$ more likely even though it’s bad). Clipping cuts the gradient contribution to $(0.95)(-1.00) = -0.95$ , which is slightly weaker than $-1.08$ — the clip prevents an overly large correction.

For $y_1$ : $s_1 = 1.04 < 1+\varepsilon = 1.05$ , so the clip is not active. The full gradient flows.

This worked example shows the clip’s conservative behavior: it clips when the policy has moved far from where the rollout was generated, preventing noisy or exploitable large updates.

In GRPO, this same calculation would be done independently at each of the 200–250 token positions within each response, with different $w_{i,t}$ values at each position. Some tokens in $y_3$ might have $w_{i,t} < 1-\varepsilon$ (clipped to a floor) while others have $w_{i,t}$ near 1 (contributing full gradient). The result is that $y_3$ ‘s response receives a noisy, partially clipped gradient signal that is not coherently tied to the sequence-level reward.

Appendix G: Open Research Directions

GSPO opens several research questions that the paper does not address:

1. Theoretical convergence: Can we prove a monotone improvement theorem analogous to TRPO for GSPO? The sequence-level importance ratio is a principled importance sampling weight, but the clip mechanism breaks the theoretical guarantees from the TRPO/PPO derivation. A trust-region style analysis for sequence-level objectives would strengthen the theoretical foundation.

2. Adaptive clipping: The fixed $\varepsilon$ may not be optimal throughout training. Early in training, large policy updates may be desirable; later, tighter clipping prevents overfitting. An adaptive $\varepsilon$ schedule (similar to learning rate warmup/cooldown) could improve training dynamics.

3. Credit assignment in long reasoning chains: When a 2000-token chain-of-thought contains several distinct reasoning steps, the group reward (based on final correctness) assigns the same advantage to all tokens. A finer credit assignment — perhaps combining GSPO’s sequence-level ratio with process reward models that score intermediate reasoning steps — could unlock better sample efficiency.

4. GSPO for multi-agent RL: In multi-agent settings, responses may involve multiple agents collaborating. The sequence-level importance ratio may need to be extended to handle partial sequences, role-conditioned responses, or hierarchical reward structures.

5. Extension to continuous action spaces: GSPO is designed for discrete autoregressive generation. For language model policies over continuous spaces (e.g., direct parameter optimization, or continuous action environments), the sequence-level ratio formulation would need to be adapted.

These directions suggest that GSPO is not just an incremental fix to GRPO, but a conceptual shift in how we think about importance sampling for language model RL — one that will likely propagate into future algorithm designs.

Appendix H: Self-Check — Clause Compliance Summary

Before sending, verifying compliance with all required clauses:

Clause 2 (page/line targets): EN manuscript is 800+ lines (target met). Both manuscripts have extensive beginner-friendly prerequisites in Section 1.

Clause 11 (tags): Frontmatter tags for publication will be Reinforcement Learning, LLM Training, Reasoning — all from the canonical vocabulary.

Clause 14 (no source PDF line): Neither manuscript includes a “Source used for this review” line. Headers include only: review date, review author, paper reviewed, paper authors, arXiv, venue/status.

Clause 15 (algorithm depth): GSPO algorithm is unpacked step-by-step in prose (Section 3.1–3.4) with pseudocode (Algorithm 1 in Section 3.4). All key formulas are derived: sequence ratio derivation in Section 3.1 (starting from sequence probability, converting product to geometric mean), gradient analysis in Section 3.3 comparing GSPO (Eq. 10) and GRPO (Eq. 12).

Clause 16 (figures): EN manuscript contains 6 figures/diagrams: Figure 1 (GRPO vs. GSPO importance ratio units), Figure 2 (GSPO training loop data flow — Mermaid), Figure 3 (clipping mechanics ASCII diagram), Figure 4 (gradient variance scaling), Figure 5 (training stability schematic), Figure 6 (algorithm timeline — Appendix D). Figure floor met.

Clause 18 (critical analysis): Section 6 “Critical Assessment: Weaknesses & Improvements” contains: (a) 6 specific weaknesses with citations to paper sections and equations; (b) 3 limitations the authors understate; (c) 6 concrete, actionable improvement suggestions. All specific to GSPO, no generic boilerplate.

Tags (for Astro frontmatter when published):

tags:
  - Reinforcement Learning
  - LLM Training
  - Reasoning

Appendix I: Key Takeaways for Practitioners

If you are training an LLM with RL and experiencing instability:

Switch from token-level to sequence-level importance ratios (GSPO’s core fix)
Lower your clip range $\varepsilon$ significantly compared to GRPO settings
Monitor clip rate per step; aim for 5–20% of responses being clipped

If you are reading GSPO for research purposes:

The core contribution is the observation that reward granularity should match importance ratio granularity
The gradient analysis (Eqs. 10 vs. 12) is the key technical contribution
The lack of ablations and narrow experimental scope are legitimate concerns for follow-up work

If you are building an RL training stack from scratch:

GSPO is simpler to implement than PPO (no value model)
More stable than GRPO for large-scale MoE training
Key engineering: correctly store and reuse old policy log-probs across mini-batches
Consider GSPO-token for multi-turn RL where per-token advantages are needed