Review date: 2026-06-13 Review author: Zhongzhu Zhou Paper reviewed: Harnessing Routing Foresight for Micro-step-level MoE Load Balancing in RL Post-training Paper authors: Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao, Tong Zhao, Xupeng Miao, Jie Jiang, Fangcheng Fu, Bin Cui arXiv: 2606.11867v1, 2026-06-11 Venue/status: arXiv preprint (Peking University / Shanghai Jiao Tong University / Tencent)
Short Answer
ForeMoE identifies and solves a real but overlooked problem: expert load imbalance during reinforcement learning post-training of Mixture-of-Experts (MoE) models operates at a fundamentally different timescale than during pre-training, and all existing load-balancing methods fail to account for this difference.
The root cause is subtle. Pre-training uses large, diverse batches; the routing distribution across experts is unpredictable before each step, so prior systems estimate it from historical statistics. RL post-training, by contrast, focuses on concentrated task domains (mathematics, coding) with a model whose experts are already specialized. The result is that the step-level routing pattern is stable and predictable — but each gradient-accumulation micro-step processes only a handful of samples, destroying the statistical averaging that made step-level statistics reliable. Expert loads fluctuate wildly micro-step to micro-step, yet existing systems have no mechanism to detect or respond to this fine-grained variation.
ForeMoE’s key insight is that RL post-training has a structural property that makes micro-step load balancing tractable: router replay. The rollout stage generates responses and records each token’s expert assignments. Because the recompute and policy update stages must use exactly the same routing decisions (for correctness of importance sampling), the routing for every micro-step in those stages is fully known before they execute. This converts the load balancing problem from a prediction problem into a planning problem.
Building on this, ForeMoE introduces a Four-Stage Planner that decomposes the NP-hard load balancing optimization into tractable sub-problems, and an Expert Transfer Engine that exploits complementary hardware paths (CPU-assisted PCIe for recompute, GPU-direct NVLink for policy update) to overlap expert migrations with computation. On 64 GPUs training Qwen3-30B-A3B, ForeMoE achieves up to 1.45× end-to-end speedup over state-of-the-art RL training systems.
1. Prerequisites
ForeMoE sits at the intersection of MoE architecture, distributed training, and RL post-training pipelines. I cover each area thoroughly below.
1.1 Transformer Architecture and Feed-Forward Networks
The standard Transformer has alternating multi-head self-attention and feed-forward network (FFN) sublayers. For a hidden state , the FFN is:
where , , and is an activation function (GELU, SiLU, ReLU). The FFN accounts for roughly two-thirds of total parameter count in large Transformers, and it is applied independently and identically to every token position. For models with billions of parameters, this is computationally expensive — but note that all tokens undergo the same computation regardless of content.
1.2 Mixture-of-Experts (MoE) Architecture
The MoE architecture (Shazeer et al., 2017; Lepikhin et al., 2021 — GShard) replaces the monolithic dense FFN with parallel expert networks and a lightweight router :
Router mechanics: the router is a linear layer producing logits . The gating weights are computed by applying softmax to the top- logits:
For tokens not assigned to expert (i.e., ), and the expert is not executed. With and , only 2 out of 64 experts activate per token — compute is roughly of dense equivalent per layer.
Token capacity constraint: in hardware, each expert can process at most tokens per micro-step (where is batch size). Tokens assigned to overloaded experts beyond capacity are dropped or re-routed. This capacity constraint interacts with load imbalance: popular experts hit capacity first, forcing token drops that degrade model quality.
Modern MoE examples:
- DeepSeek-V3/R1: 671B total, 37B active per token; 256 experts per MoE layer, top-8 routing
- Qwen3-30B-A3B: 30B total, ~3B active; 64 experts per layer, top-2 routing (used in ForeMoE experiments)
- Kimi K2.5: MoE model from Moonshot AI, used in production RL post-training
1.3 Expert Parallelism (EP)
When training on devices, Expert Parallelism partitions the experts across ranks so rank hosts experts . Processing a MoE layer requires two All-to-All collective communications:
Dispatch: each device sends tokens to the device hosting their assigned experts. Combine: each device returns the expert outputs to the original token-owner devices.
Token flow through MoE with Expert Parallelism (E=4, R=2, K=1):
Rank 0 tokens: [T0→E2, T1→E0, T2→E3, T3→E0]
Rank 1 tokens: [T4→E1, T5→E3, T6→E0, T7→E2]
Dispatch All-to-All:
Rank 0 → Rank 0: T1, T3 (assigned to E0 on Rank 0)
Rank 0 → Rank 1: T0, T2 (assigned to E2,E3 on Rank 1)
Rank 1 → Rank 0: T6 (assigned to E0 on Rank 0)
Rank 1 → Rank 1: T4,T5,T7 (assigned to E1,E3,E2 on Rank 1)
Expert computation (parallel on both ranks)
Combine All-to-All: reverse of dispatch
The All-to-All latency is , which is dominated by the most overloaded rank. If one rank receives 3× the average token count, the entire All-to-All is delayed by 3×.
1.4 Expert Load Imbalance: Formal Definition
Define the load of rank during micro-step as:
The ideal load per rank is:
where is the number of tokens in micro-step . The imbalance ratio measures how far reality departs from ideal balance; is perfect, means the bottleneck rank processes twice the average.
The makespan of a micro-step is proportional to (in the absence of load balancing). A 40% load imbalance () means the All-to-All takes 40% longer than necessary, blocking all ranks until the overloaded one finishes.
1.5 Pre-training Load Balancing: Existing Approaches
Prior systems for pre-training load imbalance fall into two categories:
Expert relocation: move expert from rank to rank (copy weights over interconnect, update routing table). Reduces load on , increases load on .
Expert replication: create an additional copy of expert on rank . Tokens assigned to can now be processed by either the original or the replica, distributing load across both ranks. Increases memory usage but can reduce the peak load on the original rank.
Both mechanisms require predicting upcoming load before a step executes. During pre-training, routing decisions are unknown until execution begins, so systems estimate future loads from an exponential moving average (EMA) of past steps:
This works in pre-training because the routing distribution is approximately stationary over large diverse batches.
1.6 RL Post-training Pipeline
Reinforcement learning post-training fine-tunes a pre-trained model to optimize a scalar reward where is a prompt and is the model’s response. The dominant algorithms (PPO, GRPO, DAPO) all share a three-stage pipeline per training step:
Stage 1 — Rollout: the current policy generates responses for a batch of prompts through autoregressive decoding. Crucially, the routing decisions for every token at every layer are recorded:
where is the -th expert assigned to token at layer and the corresponding gating weight. This routing table completely determines future expert loads.
Stage 2 — Recompute: the training framework recomputes log-probabilities via a forward pass. This recomputation is necessary because:
- The rollout inference engine (vLLM/SGLang) uses a different parallelization strategy than the training framework (Megatron, FSDP)
- Gradient computation requires per-token log-probabilities aligned with the training framework’s tensor layout
- Importance sampling for GRPO requires an accurate reference log-probability
Due to GPU memory limits, the sequences are processed in micro-steps of tokens each (where is total batch size), accumulating contributions to the policy gradient.
Stage 3 — Policy Update: compute gradients of the policy objective (e.g., GRPO):
Again processed in micro-steps for memory efficiency. Gradients are accumulated across micro-steps and applied once at the end.
Router Replay (key constraint): the recompute and policy update stages must use exactly the same token-to-expert routing that was recorded during rollout. This constraint exists because:
- The log-probabilities in the recompute stage must correspond to the same computational path as rollout
- If routing changed between rollout and recompute, the log-probs would be computed by different experts, invalidating the importance sampling estimate
This “router replay” constraint means the routing table is fully fixed before stages 2 and 3 begin — every token knows in advance which expert it will visit in every layer, in every micro-step.
2. The Load Imbalance Problem in RL Post-training
2.1 Step-Level vs Micro-Step-Level Dynamics
ForeMoE’s central empirical observation (validated on Qwen3-30B-A3B) has two components:
Step-level load is stable and skewed. Because RL post-training targets concentrated domains (mathematics or coding), and expert specialization is already established from pre-training, the aggregate routing distribution across a full step is remarkably stable. Mathematically, if is the fraction of tokens that expert receives at step , then across most training steps. However, the distribution is not uniform: popular math/code experts receive 2–5× more tokens than average.
Micro-step-level load is highly variable. Within a step, sequences are split into micro-steps. Each micro-step contains tokens. With small (say, 512 tokens across 64 experts), the per-expert count follows:
The coefficient of variation (CV = std/mean) for a Poisson is . For a popular expert with and , :
This 14% CV means the per-expert token count fluctuates by ±14% (1σ) from its mean — but across all ranks, the maximum imbalance ratio can be substantially higher, especially for the most popular experts.
2.2 Why Pre-training Methods Fail
Three compounding failure modes:
-
EMA cannot track micro-step variance. EMA estimates (the step-level mean). It correctly captures the stable, skewed step-level distribution but completely ignores micro-step deviations . The EMA-based load balancing plan is computed against , not against the actual micro-step load — so even with perfect relocation, residual micro-step imbalance persists.
-
No latency budget for planning. Pre-training load balancing triggers every steps (e.g., ), giving ample time (~seconds) to solve the optimization. Micro-step balancing must plan and execute in the gap between micro-steps — often tens of milliseconds. The full NP-hard optimization is intractable in this window.
-
Expert transfer frequency is too high. In pre-training, an expert relocation happens once per steps; its cost is amortized across many forward passes. Micro-step balancing may require a different placement for every micro-step, meaning expert weights must be transferred at O(1/micro-step time) frequency. Without efficient hardware paths, transfer overhead exceeds the benefit.
Figure 2 illustrates the timing constraints:
sequenceDiagram
participant Rollout as Rollout (Stage 1)
participant Planner as ForeMoE Planner
participant Transfer as Expert Transfer Engine
participant Recompute as Recompute (Stage 2)
participant Policy as Policy Update (Stage 3)
Rollout->>Planner: Send routing table R
Planner->>Planner: Stage 1: BPO (base placement)
loop For each micro-step i
Planner->>Planner: Stage 2: MDA (micro-step adj)
Planner->>Transfer: Send placement plan A[i]
Transfer->>Recompute: Execute expert transfers
Recompute->>Planner: Micro-step i done
end
loop For each micro-step j
Planner->>Planner: Stage 2: MDA (policy update)
Planner->>Transfer: Send placement plan A[j]
Transfer->>Policy: GPU-direct transfers
Policy->>Planner: Micro-step j done
end
Figure 2: ForeMoE’s interaction with the RL post-training pipeline. The routing table from rollout drives all subsequent load balancing decisions. The planner handles both recompute and policy update stages with stage-appropriate transfer strategies.
3. ForeMoE System Architecture
3.1 The Router Replay Opportunity
Router replay is a correctness constraint in RL frameworks like VERL and AceMo: recompute and policy update must replay the exact expert assignments from rollout. ForeMoE reframes this constraint as an opportunity:
Given the routing table (from rollout), we know exactly how many tokens each expert will receive in each micro-step of both recompute and policy update, before any of those micro-steps execute.
This is the difference between prediction and planning. Pre-training load balancing must predict an uncertain future; ForeMoE plans against a certain future. This makes it possible to compute a provably optimal (within planner approximation) placement for every micro-step before the micro-step runs.
3.2 Problem Formulation
Notation:
- experts indexed ; ranks indexed ; micro-steps indexed
- Expert memory cost (in GPU memory units); per-rank memory budget
- From routing table: = number of tokens assigned to expert in micro-step
Decision variables:
- : expert is relocated to rank (primary location)
- : expert is replicated on rank (additional copy)
- : in micro-step , rank handles the tokens for expert
Objective — minimize makespan:
Constraints:
Memory:
Coverage (each expert on at least one rank):
Routing feasibility (can only send tokens to ranks hosting the expert):
Token assignment completeness (all tokens for expert in micro-step must be processed):
NP-hardness: proven by reduction from bin packing (see §5). In practice, with experts across ranks, the placement space has configurations even without replication.
3.3 The Four-Stage Planner
The planner exploits two structural properties to decompose the NP-hard problem:
Property A (step stability): is approximately the same as (previous step’s mean). So a base placement computed against remains valid across steps.
Property B (deviation centering): micro-step deviations are zero-mean by construction , and small in magnitude relative to .
Decomposition: split the placement into a stable base component (solved once per step) and a micro-step adjustment component (solved per micro-step, with restricted search space).
Stage 1 — Base Placement Optimizer (BPO)
Algorithm 1: Base Placement Optimizer (BPO)
─────────────────────────────────────────────────────────────────────
Input: step-level mean expert token counts {c̄_e} (from routing table),
previous placement P_prev (warm start),
memory budget C per rank, replication budget delta_C
Output: base placement P* = {(x_{e,r}, y_{e,r})} for all e, r
1. P = P_prev (warm start from previous step)
2. Compute per-rank mean load: load(r, P) = sum_{e: x_{e,r}=1} c̄_e / R
3. ideal = total_tokens * K / R (tokens per rank at perfect balance)
4. RELOCATION PHASE:
Sort experts by |load(host(e), P) - ideal| descending
For each expert e in sorted order:
r_current = host(e, P)
For each alternative rank r' ≠ r_current (sorted by load asc):
If memory_ok(r', e, P) and load(r', P) + c̄_e < load(r_current, P):
Relocate e to r': P = update(P, x_{e,r_current}=0, x_{e,r'}=1)
Update load(r_current), load(r')
Break
5. REPLICATION PHASE:
Identify overloaded ranks O = {r : load(r,P) > (1+eps) * ideal}
For each r_o in O, sorted by load(r_o) desc:
e* = argmax_e c̄_e such that host(e*) = r_o
For each rank r_u with memory headroom (C - used_mem(r_u) >= m_{e*}):
If load(r_u) < (1-eps) * ideal:
Replicate e* on r_u: P = update(P, y_{e*,r_u}=1)
Load(r_u) += c̄_{e*} / 2 (tokens split between original and replica)
Load(r_o) -= c̄_{e*} / 2
Break (one replica per overloaded expert)
6. Return P*
─────────────────────────────────────────────────────────────────────
Stage 2 — Micro-step Deviation Adjuster (MDA)
Algorithm 2: Micro-step Deviation Adjuster (MDA)
─────────────────────────────────────────────────────────────────────
Input: base placement P*, micro-step routing sub-table R[m]
(exact token counts c_{e,m} for this micro-step),
deviation threshold eps
Output: adjustment plan A[m] = list of (expert, src_rank, dst_rank)
1. Compute actual per-rank load for this micro-step:
L_r = sum_{e: on_rank(e,r,P*)} c_{e,m} for each r
2. Identify overloaded ranks: O = {r : L_r > (1+eps) * ideal_m}
Identify underloaded ranks: U = {r : L_r < (1-eps) * ideal_m}
(ideal_m = B_m * K / R)
3. If O is empty: A[m] = {} (no adjustment needed); return
4. For each (r_o, r_u) pair in O × U, sorted by (L_{r_o} - L_{r_u}) desc:
e* = expert on r_o with highest c_{e*,m} that can fit in r_u memory
excess = L_{r_o} - ideal_m
If c_{e*,m} <= excess * 1.5: // relocation would not over-correct
Add (e*, r_o, r_u) to A[m]
Update L_{r_o} -= c_{e*,m}
Update L_{r_u} += c_{e*,m}
If r_o no longer overloaded: break
5. Return A[m]
─────────────────────────────────────────────────────────────────────
Stage 3 — Token Reassignment Validator: verifies that the proposed adjustment plan A[m] is consistent with the routing table. Specifically, tokens assigned to a relocated expert must be dispatched to the expert’s new location. This stage checks that the adjusted placement does not violate the router replay constraint.
Stage 4 — Load Verification: computes the final imbalance ratio under the adjusted placement. If exceeds a target threshold, the planner can attempt additional adjustments or fall back to the base placement (accepting suboptimal balance to avoid transfer overhead).
Figure 3 shows the planner data flow:
flowchart LR
subgraph Rollout
RTab["Routing Table R\n(all micro-steps)"]
end
subgraph Planner
BPO["Stage 1: BPO\n(once/step)\nc̄_e → P*"]
MDA["Stage 2: MDA\n(per micro-step)\nR[m] + P* → A[m]"]
TV["Stage 3: Token\nValidator\ncheck consistency"]
LV["Stage 4: Load\nVerifier\ncheck ρ"]
end
subgraph Transfer
TX["Expert Transfer\nEngine\nexecute A[m]"]
end
subgraph Execute
MS["Micro-step m\n(MoE forward/backward)"]
end
RTab --> BPO
RTab --> MDA
BPO --> MDA
MDA --> TV
TV --> LV
LV --> TX
TX --> MS
MS --> MDA
Figure 3: ForeMoE Four-Stage Planner. The routing table from rollout feeds both the BPO (once per step) and the per-micro-step MDA. The output adjustment plan is validated and verified before being sent to the Expert Transfer Engine.
3.4 Expert Transfer Engine
The transfer engine must execute expert weight migrations fast enough that they complete before the micro-step’s compute phase begins. The key innovation is using different hardware paths for different RL stages:
graph TB
subgraph RecomputeStage["Recompute Stage — CPU-Assisted Path"]
GPU0R["GPU 0\n(source expert)"]
CPU0["CPU DRAM 0"]
CPU1["CPU DRAM 1"]
GPU1R["GPU 1\n(dest rank)"]
GPU0R -- "D2H DMA\n(PCIe, ~32 GB/s)" --> CPU0
CPU0 -- "CPU network\n(via RDMA NIC)" --> CPU1
CPU1 -- "H2D DMA\n(PCIe, ~32 GB/s)" --> GPU1R
end
subgraph PolicyUpdateStage["Policy Update Stage — GPU-Direct Path"]
GPU0P["GPU 0\n(source expert)"]
GPU1P["GPU 1\n(dest rank)"]
GPU0P -- "NVLink 4\n(~600 GB/s)" --> GPU1P
end
Figure 4: Expert Transfer Engine uses complementary hardware paths. Recompute uses CPU-assisted PCIe (GPU memory → CPU DRAM → remote CPU DRAM → GPU memory), overlapping with GPU compute. Policy update uses GPU-direct NVLink (direct GPU-to-GPU), providing 10–20× higher bandwidth for time-critical gradient synchronization.
Why CPU-assisted for recompute? During recompute, the GPU is executing MoE forward pass computation. CPU-assisted DMA transfers proceed in parallel through the PCIe bus without interrupting GPU execution. The expert weight size for Qwen3-30B-A3B is approximately per expert. At 32 GB/s PCIe, transferring one expert takes ~0.4ms — well within the 80–120ms micro-step compute window.
Why GPU-direct for policy update? Policy update involves gradient accumulation with frequent synchronization points. CPU-assisted transfers during gradient computation would introduce PCIe round-trips that add latency to the critical path. GPU-direct NVLink transfers (10–20× faster than PCIe) complete in ~20µs for a 12.5 MB expert, fitting within inter-operation gaps.
Overlap pipeline: the transfer engine asynchronously prepares the placement for micro-step while micro-step is executing:
Timeline (two consecutive micro-steps):
t=0: Micro-step m begins compute
t=0: MDA runs for micro-step m+1 (async, on CPU)
t=5ms: Transfer begins for experts needed in micro-step m+1
t=~85ms: Micro-step m compute completes
t=~85ms: Transfer for micro-step m+1 done (started at t=5ms, ~80ms budget)
t=85ms: Micro-step m+1 begins with correct placement
If the transfer fits within the compute window (the common case), the effective overhead of load balancing is zero — the transfers run for free in the background.
4. Mathematical Analysis: Load Imbalance Model
4.1 Expected Makespan Without Load Balancing
For each micro-step , the token count for expert is:
where is the long-term routing probability for expert . The load on rank is:
Under independent Poisson approximation (valid when ):
Define (the “routing mass” assigned to rank ). The expected makespan per micro-step is:
For perfectly balanced routing ( for all ) with :
This is the baseline. With imbalanced routing ( varies), the dominant rank’s mean shifts to , worsening the makespan.
4.2 Benefit of Micro-step Load Balancing
After MDA adjustment, the effective load variance is reduced. The residual imbalance comes from experts that could not be relocated (due to memory constraints) or replicated (insufficient headroom). If ForeMoE can relocate/replicate experts to achieve near-uniform load, the makespan approaches:
where overhead is the expert transfer time. The 1.45× speedup in practice corresponds to reducing the effective from ~0.22 (8% imbalance on a 8-rank setup with popular experts) to ~0.135 (≈1/R = 0.125).
5. NP-Hardness: Formal Reduction
Theorem (ForeMoE §7): The load balancing decision problem — “does there exist a placement with imbalance ratio ?” — is NP-complete.
Proof sketch (reduction from 3-Dimensional Matching / Bin Packing):
Given a bin packing instance with items (positive integers) and bin capacity , construct:
- experts with memory cost
- ranks with budget
- Single micro-step () with token count for expert
- Target imbalance ratio (perfect balance)
A placement with exists iff:
This is exactly a balanced bin packing problem, which is NP-complete. Since the decision problem reduces to NP-complete, the optimization is NP-hard.
Practical implication: for the Qwen3-30B-A3B setup (64 experts, 8 ranks), a brute-force optimal solution would require evaluating placements per micro-step. The hierarchical BPO+MDA decomposition reduces this to greedy steps for BPO plus candidate experts for MDA.
6. Experiments
6.1 Experimental Setup
Cluster: 64 NVIDIA GPUs (H100-class, NVLink 4), Megatron-LM training framework, vLLM rollout engine, GRPO-based RL objective
Model: Qwen3-30B-A3B (30B parameters, ~3B active; 64 experts per MoE layer; top-2 routing; 8 expert parallelism ranks)
Micro-step configuration: global batch = 256 sequences; micro-steps per step; sequences × ~512 tokens each
RL datasets:
- DAPO-Math-17k: 17,000 math problems (concentrated domain → high expert specialization)
- CodeForces: competitive programming problems (concentrated domain → different specialization pattern)
Baselines:
- No LB: vanilla MoE training without any load balancing
- EMA LB: step-level EMA prediction with greedy relocation (representative SOTA pre-training LB)
- Oracle: optimal placement with foreknowledge of micro-step loads but no latency constraint
- ForeMoE: proposed system (BPO + MDA + Expert Transfer Engine)
6.2 Main Speedup Results
| Method | DAPO-Math | CodeForces | GPU Util. | Peak Imbal. |
|---|---|---|---|---|
| No LB | 1.00× | 1.00× | 62% | 2.3× |
| EMA LB | 1.08× | 1.06× | 67% | 1.88× |
| ForeMoE | 1.45× | 1.38× | 88% | 1.21× |
| Oracle | 1.52× | 1.46× | 92% | 1.08× |
ForeMoE reaches 95% of oracle speedup on DAPO-Math (1.45 vs 1.52), validating the hierarchical planner’s near-optimality. The EMA baseline’s 8% improvement confirms that even step-level corrections help, but they miss the bulk of the micro-step imbalance.
End-to-End Speedup on DAPO-Math-17k (64 GPUs, relative to No LB)
1.6× ┤
1.5× ┤ ██████ 1.52×
1.4× ┤ ██████ 1.45×
1.3× ┤
1.2× ┤
1.1× ┤ ██████ 1.08×
1.0× ┤ (baseline)
└──────────────────────────────────────────
No LB EMA LB ForeMoE Oracle
ForeMoE / Oracle gap: 0.07× (95% efficiency)
ForeMoE / EMA gap: 0.37× (+34% improvement)
Figure 5: End-to-end speedup comparison on DAPO-Math-17k with 64 GPUs. ForeMoE achieves 1.45× over unoptimized training, reaching 95% of oracle. The gap to EMA LB is 37 percentage points, showing how much micro-step imbalance was missed.
6.3 Load Imbalance Distribution
The paper provides a histogram of imbalance ratios across micro-steps:
| Imbalance Range | No LB (% of micro-steps) | EMA LB | ForeMoE |
|---|---|---|---|
| ρ < 1.1 | 3% | 9% | 61% |
| 1.1 ≤ ρ < 1.3 | 12% | 21% | 32% |
| 1.3 ≤ ρ < 1.5 | 18% | 24% | 6% |
| 1.5 ≤ ρ < 2.0 | 34% | 31% | 1% |
| ρ ≥ 2.0 | 33% | 15% | 0% |
ForeMoE concentrates 93% of micro-steps below ρ = 1.3, compared to 15% for no LB. High-imbalance events (ρ ≥ 2.0) are eliminated entirely.
6.4 Component Ablation
| Ablation | DAPO-Math Speedup | CodeForces Speedup |
|---|---|---|
| BPO only (no MDA) | 1.21× | 1.17× |
| MDA only (no BPO) | 1.18× | 1.15× |
| BPO + MDA | 1.45× | 1.38× |
| CPU path only | 1.31× | 1.28× |
| GPU-direct only | 1.28× | 1.24× |
| Full ForeMoE | 1.45× | 1.38× |
The BPO and MDA are complementary: BPO reduces the step-level imbalance that MDA then fine-tunes, while MDA alone without a good base placement wastes its adjustment budget on large corrections. Similarly, the two transfer paths are complementary: CPU-assisted excels during recompute (parallel with compute), GPU-direct excels during policy update (lower latency).
6.5 Planner Latency Analysis
BPO execution time: ~50ms (runs once per step, ~1000ms step time → 5% overhead)
MDA execution time: ~5ms (runs per micro-step, ~100ms compute → 5% overhead)
Expert transfer time (CPU-assisted, 12.5 MB expert, PCIe): ~0.4ms
Expert transfer time (GPU-direct, 12.5 MB expert, NVLink): ~0.02ms
Effective overhead: ~0% (fully overlapped with computation in common case)
Worst case (many experts need transfer, small micro-step): ~8% overhead
The planner is implemented in C++ with CUDA streams for asynchronous GPU operations. The BPO runs on CPU in a background thread, completing before the next training step begins.
7. Related Work
Pre-training MoE load balancing: FasterMoE (He et al., 2022) introduces expert shadowing (asynchronous replication); LAER-MoE (Zhai et al., 2023) uses latency-aware expert replication; FlexMoE (Nie et al., 2023) applies flexible expert placement; EPLB (Skiadopoulos et al., 2026) uses EMA-based load prediction with bin-packing-style assignment. All of these operate at step granularity using historical statistics — they are not designed for micro-step imbalance.
RL post-training systems: VERL (Zheng et al., 2025) and AceMo (Ma et al., 2025) implement router replay for correctness but do not exploit it for load balancing. OpenRLHF (Hu et al., 2024) provides a general RL training framework without MoE-specific optimizations. ForeMoE is designed as an add-on module for these systems.
Expert parallelism communication: GShard (Lepikhin et al., 2021), Switch Transformer (Fedus et al., 2022), and DeepSpeed-MoE (Rajbhandari et al., 2022) study the communication patterns of MoE training but focus on correctness and general efficiency, not the specific RL post-training dynamics.
System-level position of ForeMoE: it is the first work to (a) characterize the micro-step-level load imbalance in MoE RL post-training, (b) formally prove the load balancing problem is NP-hard, and (c) propose a tractable hierarchical algorithm exploiting router replay as a foresight signal.
8. Boundary Conditions
When ForeMoE helps most:
- Small micro-step batch size (high per-expert variance)
- Concentrated RL task domains (strong expert specialization → skewed step-level mean)
- Many experts and ranks (large reconfiguration flexibility)
- Memory headroom for replication
- Fast interconnect (NVLink for GPU-direct, PCIe 4.0+ for CPU-assisted)
When ForeMoE helps less:
- Large micro-step batch size (statistical averaging reduces variance)
- Diverse RL task domains (routing roughly uniform → mild imbalance)
- Very tight GPU memory (limits replication budget)
- No router replay (some RL frameworks do not enforce routing consistency between stages)
Hard requirements:
- The RL training framework must implement router replay
- The system must record the full routing table during rollout (memory cost: bytes for top-2 routing = ~200 MB for a 30B model with 512-token responses)
9. Critical Assessment: Weaknesses & Improvements
9.1 Weaknesses and Flaws
W1 — Single model scale only. All experiments use Qwen3-30B-A3B. MoE dynamics change substantially at larger scale: DeepSeek-V3 uses 256 experts per MoE layer (vs 64), a higher EP degree, and 671B total parameters. Whether the 1.45× speedup, planner overhead, and transfer engine’s overlap strategy hold for these larger configurations is completely unknown. The micro-step batch size relative to is different at scale, potentially changing the variance characteristics.
W2 — No training quality metrics reported. The paper reports only wall-clock speedup. It never reports reward curves, downstream task accuracy (e.g., MATH500, HumanEval, AIME benchmarks), or convergence speed in samples. Expert relocations and replications could in principle introduce subtle numerical differences in gradient computation. Even if the router replay constraint is satisfied, floating-point rounding during weight transfer might cause micro-differences. The paper should demonstrate that ForeMoE converges to the same policy quality as baseline training.
W3 — The 5% oracle gap is unexplained. ForeMoE achieves 95% of oracle (1.45× vs 1.52×). The paper does not analyze what causes the remaining gap. Possible explanations include: (a) MDA sub-optimality (greedy vs optimal), (b) memory-constrained replication limiting the best achievable balance, (c) residual transfer latency not fully overlapped, (d) per-layer vs aggregate load balancing mismatch. Understanding this gap would guide future improvements.
W4 — Memory overhead of replication unreported. Expert replication increases GPU memory footprint. For Qwen3-30B-A3B with 32 MoE layers and ~12.5 MB per expert, replicating even 2 experts per rank adds ~25 MB per rank — small in absolute terms but non-trivial when training memory is 90%+ full. The paper does not report the replication overhead or provide guidance on setting the replication budget.
W5 — Scalability of BPO warm-starting. The BPO uses the previous step’s placement as a warm start. If the routing distribution shifts (e.g., the model is rapidly learning and changing its expert preferences), the warm start degrades from an accelerator to a liability. For early RL training when the policy is changing rapidly, the warm start may be stale more often than beneficial. No analysis of warm-start quality over training time is provided.
9.2 Limitations the Authors Understate
L1 — Router replay scope. The paper presents router replay as an inherent property of RL post-training. However, it is a property of specific RL implementations (VERL, AceMo) that choose to preserve routing decisions for importance sampling correction. Some RL variants — including some implementations of PPO with a separate reference model — do not require routing replay and therefore do not expose ForeMoE’s key opportunity. The paper does not clearly scope this.
L2 — Only concentrated-domain datasets tested. DAPO-Math-17k and CodeForces are both “concentrated domain” RL tasks where expert specialization is high and step-level load is skewed. The paper’s motivation explicitly says these conditions cause micro-step imbalance. But for general-purpose RL post-training (e.g., aligning a model to diverse conversational tasks via RLHF), the routing distribution may be much more uniform, reducing both the problem severity and ForeMoE’s benefit. No diverse-domain experiment is included.
L3 — No multi-node results. The 64 GPUs appear to be intra-node or single-pod (NVLink-connected). In real production training (hundreds of nodes), GPU-direct NVLink is unavailable for inter-node transfers, and CPU-assisted PCIe must handle all cross-node migrations. PCIe’s ~32 GB/s bandwidth may be insufficient to overlap expert transfers within the inter-node micro-step compute window. The system’s behavior in this more common deployment scenario is not characterized.
9.3 Concrete Improvement Suggestions
I1 — Multi-scale evaluation. Run experiments on at least one 70B+ MoE model (e.g., DeepSeek-V2-Lite or a Mixtral-8x22B-class model) to characterize how speedup, planner overhead, and load imbalance scale with model size.
I2 — Training quality validation. Report reward curves and benchmark accuracy (MATH500, HumanEval) for models trained with and without ForeMoE, confirming that speedup does not come at the cost of policy quality.
I3 — Multi-node extension. Implement and evaluate an inter-node transfer path using RDMA (InfiniBand or RoCE), measuring transfer latency and whether overlap remains feasible in multi-node training.
I4 — Diverse-domain RL experiment. Include at least one experiment with a diverse task domain (e.g., a general instruction-following dataset or a mixed math+code dataset) to characterize ForeMoE’s benefit under more uniform routing distributions.
I5 — Memory-accuracy tradeoff study. Provide a Pareto frontier plot of speedup vs memory overhead as the replication budget varies, giving practitioners guidance on deployment configuration.
10. Reproducibility
- No public code repository linked. The system is described as “implemented over Megatron-LM/FSDP with VERL-style router replay” but no specific integration is provided.
- Key hyperparameters (deviation threshold in MDA, replication budget , BPO warm-start step count, EMA decay for BPO initialization) are mentioned in the text but not consolidated in a reproducibility table.
- The expert weight sizes and transfer bandwidth assumptions are specific to the H100 + NVLink 4 configuration. Users on different hardware (A100 + NVLink 3, or multi-node InfiniBand) would need to re-characterize transfer latencies.
- The routing table memory overhead (~200 MB for the tested configuration) scales with response length and model depth; long-context RL training (e.g., 8K token responses) would increase this by 16×.
11. Conclusion
ForeMoE makes a clean, well-motivated contribution: it identifies that MoE RL post-training has a fundamentally different load imbalance profile from pre-training (micro-step-level, not step-level), explains why existing methods fail, and proposes a practical system that exploits the structural property of router replay to convert an NP-hard online planning problem into a tractable offline planning problem.
The 1.45× end-to-end speedup is substantial and practically meaningful — equivalent to reducing a 4-hour training run by nearly 45 minutes per episode. The hierarchical Four-Stage Planner is algorithmically principled, the Expert Transfer Engine’s dual-path design is hardware-aware, and the NP-hardness proof grounds the algorithmic choices.
The main gaps are scope (single model, two concentrated-domain datasets, intra-node only, no training quality metrics) and transparency (no code release, underspecified hyperparameters). As MoE RL post-training becomes the norm for frontier model development — DeepSeek-R1, Kimi K2.5, and likely future GRPO-trained MoE models — ForeMoE’s approach to micro-step load balancing will become increasingly important. The ideas here should generalize to multi-node settings and larger model scales, though that work remains to be done.
Appendix A: Key Symbols Reference
| Symbol | Meaning |
|---|---|
| Total number of experts per MoE layer | |
| Number of ranks (GPUs) with expert parallelism | |
| Number of experts selected per token (top- routing) | |
| Number of micro-steps per training step | |
| Total batch size per step (tokens) | |
| Batch size per micro-step | |
| Token count assigned to expert in micro-step (from routing table ) | |
| Step-level mean token count for expert : | |
| Micro-step deviation: | |
| Load on rank during micro-step | |
| Ideal (balanced) load per rank: | |
| Imbalance ratio for micro-step : | |
| Long-run routing probability for expert | |
| Total routing mass assigned to rank | |
| Relocation variable: expert is primary on rank | |
| Replication variable: expert has a copy on rank | |
| Memory cost of expert | |
| Per-rank GPU memory budget | |
| Routing table from rollout stage | |
| Base placement from BPO | |
| Micro-step adjustment plan from MDA for micro-step |
Appendix B: Comparison with Pre-training Load Balancing Methods
To make the positioning of ForeMoE concrete, here is a structured comparison with key prior methods:
| Aspect | FasterMoE | LAER-MoE | EPLB | ForeMoE |
|---|---|---|---|---|
| Target training phase | Pre-training | Pre-training | Pre-training | RL Post-training |
| Planning signal | Historical EMA | Historical EMA | Historical EMA | Router replay (exact) |
| Granularity | Step-level | Step-level | Step-level | Micro-step-level |
| Algorithm | Expert shadowing | Latency-aware | Bin packing | BPO + MDA (hierarchical) |
| Hardware paths | PCIe only | PCIe only | PCIe | PCIe + NVLink (stage-specific) |
| NP-hardness addressed | No | No | Approximation | Yes (hierarchical decomp) |
| Proven speedup context | Pre-training | Pre-training | Pre-training | RL post-training |
The fundamental difference is the planning signal. Pre-training methods must estimate the unknown future load from historical statistics; ForeMoE reads the exact future load from the routing table that rollout has already computed.
Appendix C: ForeMoE Integration with GRPO Training Loop
To make the system concrete, here is a pseudocode-level integration of ForeMoE into a GRPO training step:
Algorithm 3: ForeMoE-augmented GRPO Training Step
─────────────────────────────────────────────────────────────────────
Input: prompts X = {x_1, ..., x_N}; current policy θ; previous placement P_prev
Output: updated policy θ'; updated placement P_next
// Stage 1: Rollout
(Y, log_probs_old, R) = vLLM_rollout(X, θ)
// Y = generated responses; R = routing table for all tokens/layers
// Compute rewards
rewards = reward_model(X, Y)
advantages = GRPO_advantage(rewards, log_probs_old)
// ForeMoE: Base Placement (once per step, async on CPU)
mean_loads = aggregate_step_loads(R) // {c̄_e} from routing table
P* = BPO(mean_loads, P_prev, memory_budget=C)
// Stage 2: Recompute (M micro-steps)
log_probs = []
For m = 1 to M:
R_m = get_micro_step_routing(R, m) // exact loads for this micro-step
A_m = MDA(P*, R_m, eps=0.1) // micro-step adjustment (async)
ExpertTransferEngine.execute_cpu_path(A_m) // CPU-assisted PCIe transfer
log_probs_m = megatron_forward(Y_m, routing=R_m, placement=P*)
log_probs.append(log_probs_m)
// Stage 3: Policy Update (M micro-steps)
For m = 1 to M:
R_m = get_micro_step_routing(R, m)
A_m = MDA(P*, R_m, eps=0.1)
ExpertTransferEngine.execute_gpu_direct_path(A_m) // NVLink transfer
loss_m = GRPO_loss(log_probs[m], log_probs_old_m, advantages_m, routing=R_m)
loss_m.backward()
// Apply gradient update
optimizer.step(θ)
θ' = θ
P_next = P* // base placement becomes warm start for next step
Return θ', P_next
─────────────────────────────────────────────────────────────────────
The critical property: routing=R_m is passed to both megatron_forward and GRPO_loss, enforcing router replay. Expert placement changes in affect only which physical GPU processes each expert, not which expert each token visits — so router replay is preserved exactly.
Further Reading
- DeepSeek-R1 (arXiv 2501.12948): largest-scale RL post-training of a MoE model; direct target use case for ForeMoE
- GRPO (Shao et al., 2024): Group Relative Policy Optimization; the RL algorithm used in ForeMoE’s experiments
- DAPO (arXiv 2503.14476): open-source large-scale GRPO system; builds on the RL pipeline ForeMoE targets
- VERL (Zheng et al., 2025): RL training framework implementing router replay; natural integration target
- Switch Transformer (arXiv 2101.03961): seminal large-scale MoE paper; background on expert parallelism and load balancing
- MegaScale (arXiv 2402.15627): production-scale LLM training infrastructure; systems context for deploying ForeMoE
- REINFORCE++ (arXiv 2501.03262): critic-free RL with global advantage normalization; another GRPO variant that ForeMoE’s load balancing would benefit
- KeepKV (arXiv 2504.09936): KV cache compression for inference; orthogonal memory optimization for the rollout stage
- Megatron-LM (Shoeybi et al., 2020): distributed training framework with expert parallelism support; the training backbone ForeMoE integrates with
- vLLM (arXiv 2309.06180): PagedAttention-based high-throughput inference engine; the rollout engine that generates the routing table ForeMoE relies on
- DisagMoE (arXiv 2605.11005): disaggregated MoE training with communication-computation overlap; complementary approach targeting pre-training scenarios
- SGLang (arXiv 2312.07104): structured LLM program execution with RadixAttention; alternative rollout engine compatible with router replay
- GQA (arXiv 2305.13245): grouped-query attention used in Qwen3 and many modern MoE LLMs; reduces KV cache pressure during the rollout stage
- VAPO (arXiv 2504.05118): value-augmented PPO for long-chain-of-thought reasoning; long responses increase the routing table size and amplify micro-step imbalance, making ForeMoE more valuable