Review date: 2026-06-25 Review author: Zhongzhu Zhou Paper reviewed: ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving Paper authors: Haipeng Yuan, Kaining Zheng, Yongshu Bai, Yuchen Zhang, Yunquan Zhang, Baodong Wu, Xiang Gao, Daning Cheng arXiv: 2606.18741 Status/Venue: arXiv preprint, June 2026 (ICT/Chinese Academy of Sciences, Zhejiang Lab, Infinigence AI)
Short Answer
Every production LLM serving system today hard-wires its Tensor Parallelism (TP) and Pipeline Parallelism (PP) configuration at launch time. Changing from TP4PP2 to TP2PP4 requires restarting the service — which takes several minutes and loses the entire KV cache. ReMP breaks this constraint. It decouples four normally static runtime states — model weights, KV cache, communication groups, and worker lifetimes — from the launch-time topology, then re-establishes each under the new configuration without tearing down the process. The result: topology switches complete in 1–7 seconds on models ranging from 7B to 70B parameters, 10–100× faster than restart, while preserving previously computed KV cache states to avoid expensive prefill recomputation.
Prerequisites: What You Need to Know First
ReMP is a systems paper that sits at the intersection of distributed model parallelism and LLM serving. To understand why it is hard and why the techniques work, you need to understand five building blocks.
1. Autoregressive Decoding and the KV Cache
Transformer-based language models generate tokens one at a time. At decode step , the model receives all past tokens and produces one new token. The core computation is attention over the full history. Without any optimization, computing key and value projections for all past tokens at every step costs where is the sequence length.
The KV cache eliminates this waste: after a token’s key and value are computed in a given layer, they are stored in GPU memory and reused at every future step. This reduces per-step attention from to but introduces a memory cost:
where is the number of transformer layers, is the number of KV heads per layer, is the head dimension, is the context length, and is the batch size. For Llama2-70B (80 layers, 8 KV heads, ) with a 4K context, batch 32, BF16:
The KV cache layout is inseparable from the model parallelism topology — it determines which GPU stores which layers and which attention heads. This coupling is the root cause of ReMP’s problem.
2. Tensor Parallelism (TP)
Tensor parallelism (Megatron-LM, 2021) splits individual weight matrices across multiple GPUs along a dimension so that each GPU computes a slice of every layer’s output.
For the attention projection , TP splits across the head dimension:
GPU holds slice and computes , where is the TP degree. After each attention sub-computation, an AllReduce across the GPUs collects the full output.
The key trade-off:
- Higher TP degree → smaller per-GPU memory, faster single-request latency (more GPUs share computation), but more frequent AllReduce communication overhead.
- Lower TP degree → less communication overhead, higher single-GPU utilization at scale.
In KV cache terms, TP rank on a given pipeline stage owns KV heads for every layer on that stage.
3. Pipeline Parallelism (PP)
Pipeline parallelism (GPipe, PipeDream) partitions the transformer’s layers across pipeline stages. Stage owns layers:
During inference, a request flows through stages 0 → 1 → … → . Multiple micro-batches can overlap in the pipeline to improve GPU utilization (the “pipeline bubble” trade-off).
The KV cache for a given request is partitioned across stages: stage stores the KV blocks for its own layers. When TP changes the number of active workers per stage, or PP changes layer ownership between stages, KV cache blocks stored on old GPUs must be physically moved to the correct new GPUs.
4. vLLM and PagedAttention
vLLM (Kwon et al., 2023) is the dominant open-source LLM serving engine. Its key contribution, PagedAttention, manages the KV cache as a pool of fixed-size blocks (like OS virtual memory pages), eliminating fragmentation and enabling dynamic block allocation. Each block holds the KV state for a contiguous chunk of tokens.
ReMP is implemented on top of vLLM v1. The integration touches vLLM’s executor, worker management, KV cache manager, and communication initialization pipeline.
5. The Static Topology Problem
When vLLM (or any serving engine) launches with TP2PP4, the following are all initialized with respect to that specific configuration:
- Weight tensors: each GPU loads its fixed shard.
- Communication groups: NCCL groups are created for TP AllReduce and PP peer-to-peer.
- KV cache blocks: allocated and indexed with a specific layer-to-stage and head-to-TP-rank mapping.
- Worker processes: exactly TP × PP = 8 workers, each bound to one (pp_rank, tp_rank).
There is no mechanism to change any of these without shutting down and restarting from scratch. In production, restart takes 2–10 minutes (checkpoint reload, CUDA initialization, NCCL group setup, memory allocation). All accumulated KV cache is lost, forcing every in-flight request to redo its prefill computation.
Figure 1: Static topology coupling vs. ReMP’s decoupled design
graph TD
subgraph Static["Static vLLM (Current)"]
WT["Weight Shards<br/>(bound to topology)"] --> T["TP/PP Topology<br/>(fixed at launch)"]
KV["KV Cache Layout<br/>(bound to topology)"] --> T
CG["NCCL Comm Groups<br/>(bound to topology)"] --> T
WK["Worker Processes<br/>(bound to topology)"] --> T
end
subgraph ReMP["ReMP (Decoupled)"]
SWS["Shared Weight Store<br/>(CPU shared memory)"] --> RC["Reconfiguration Controller"]
MSS["MPU State Space<br/>(multi-topology snapshots)"] --> RC
KME["KV Migration Engine<br/>(2D remap)"] --> RC
WLM["Worker Lifecycle Mgr<br/>(standby/wakeup)"] --> RC
SA["Scheduler Adapter<br/>(rebinds after switch)"] --> RC
end
T -. "restart required (minutes)" .-> T
RC -. "online switch (seconds)" .-> RC
Problem Statement: Dynamic Workloads vs Static Topology
ReMP’s motivation is rooted in a real production observation. Real-world LLM serving traffic is not constant — it follows diurnal patterns. The paper shows traffic data from two production systems (INFINIGENCE AI and Zhejiang Lab) where traffic at 20:00 is roughly 3–5× higher than at 04:00.
The critical insight: the optimal TP/PP configuration is traffic-dependent.
- Low traffic (few concurrent requests): Latency is the bottleneck. Minimize per-request processing time by using high TP degree — more GPUs share each request’s computation, reducing TTFT (Time To First Token).
- High traffic (many concurrent requests): Throughput is the bottleneck. Use high PP degree — pipeline stages process different requests in parallel, maximizing tokens/second, while TP’s AllReduce overhead per step becomes the limiting factor.
A fixed TP4PP2 configuration, optimized for low traffic, degrades throughput at peak. A fixed TP1PP8 configuration, optimized for peak, has unacceptable latency during off-peak. Because restart costs minutes, production systems are forced to pick one configuration and live with suboptimal performance for part of the day.
Figure 2: TP vs. PP trade-off under different traffic regimes
graph LR
subgraph LowLoad["Low Traffic"]
direction TB
L1["High TP degree<br/>(e.g., TP4PP2)"]
L2["More GPUs per request<br/>→ faster single-request latency"]
L3["TTFT ↓, TPOT ↓"]
L1 --> L2 --> L3
end
subgraph HighLoad["High Traffic"]
direction TB
H1["High PP degree<br/>(e.g., TP1PP8)"]
H2["More pipeline stages<br/>→ higher concurrency"]
H3["Throughput ↑, fewer AllReduce"]
H1 --> H2 --> H3
end
LowLoad -. "traffic spike" .-> HighLoad
HighLoad -. "traffic drops" .-> LowLoad
style LowLoad fill:#e8f4e8
style HighLoad fill:#fde8e8
Core Contribution
ReMP makes TP/PP topology a runtime-adjustable resource rather than a launch-time parameter. It achieves this through three coupled innovations:
- State decoupling: Break the hard binding between topology and the four runtime states (weights, KV cache, communication groups, workers).
- Two-dimensional KV cache migration: Semantically remap and physically transfer KV blocks along both the PP (layer) and TP (KV head) dimensions simultaneously.
- End-to-end integration in vLLM v1: A complete implementation covering executor, workers, cache manager, and comm initialization.
System Design
Architecture: Six Components
ReMP augments vLLM’s serving engine with six new modules:
Figure 3: ReMP architecture and component interactions
graph TB
RC["Reconfiguration Controller<br/>(entry point, transaction coord)"]
SWS["Shared Weight Store<br/>(full model in CPU shared mem)"]
MSS["MPU State Space<br/>(pre-built topology snapshots)"]
WLM["Worker Lifecycle Manager<br/>(active / standby / wakeup)"]
KME["KV Migration Engine<br/>(2D remap + P2P transfer)"]
SA["Scheduler Adapter<br/>(block mgr rebind + preempt)"]
RC --> SWS
RC --> MSS
RC --> WLM
RC --> KME
RC --> SA
SWS -. "CPU SharedStateDict<br/>reloads target GPU shards" .-> WLM
MSS -. "apply target topology<br/>NCCL groups + rank maps" .-> WLM
KME -. "layer-by-layer P2P<br/>batched transfers" .-> WLM
SA -. "rebuild block table<br/>preempt if needed" .-> RC
Reconfiguration Controller: The transaction entry point. Receives the target (TP, PP) pair, orchestrates all other components, and ensures that model shards, KV cache, communication state, and scheduler metadata are mutually consistent before resuming serving.
Shared Weight Store: On startup, the full model checkpoint is loaded into CPU shared memory as a SharedStateDict. When a topology switch occurs, each worker reads its target shard directly from this shared store rather than re-loading checkpoint files from disk. This eliminates the dominant latency of restart: disk I/O and checkpoint deserialization.
MPU (Model Parallel Unit) State Space: For each candidate topology (e.g., TP1PP8, TP2PP4, TP4PP2, TP8PP1), a parallel-state snapshot is pre-built at startup. Each snapshot contains:
- TP group membership and inter-GPU process groups
- PP peer ranks (prev/next stage)
- Rank-to-device mapping
- Per-topology NCCL communicator handles
Switching means selecting the correct snapshot and applying it to vLLM’s global parallel_state — an operation rather than a full NCCL group rebuild.
Worker Lifecycle Manager: Manages three worker states — active, standby, and waking_up. When the new topology needs fewer GPUs (e.g., TP8PP1 → TP4PP1), extra workers enter standby after their KV state is migrated away. When more GPUs are needed, standby workers wake up and synchronize the message-queue ring index so they can receive control messages immediately.
KV Migration Engine: The most technically complex component. It computes the 2D remapping plan and executes batched P2P GPU-to-GPU transfers.
Scheduler Adapter: After a topology switch, the block manager’s KV block table must reflect the new memory layout. The adapter:
- Regenerates the KV cache configuration (block size, num_blocks per GPU, layer-to-stage mapping).
- Expands or shrinks per-GPU block pools.
- Preempts running requests whose KV blocks now span workers that no longer exist.
- Refreshes the pipeline-parallel batch queue.
Two-Dimensional KV Cache Migration
The KV cache layout is governed by two independent dimensions that both change when topology changes:
- Pipeline Parallelism (layer dimension): Stage in a -stage pipeline owns layers . When changes, every layer potentially moves to a new stage on a new GPU.
- Tensor Parallelism (head dimension): TP rank in a -rank tensor-parallel group owns KV heads of every layer on its stage. When changes, the head slices are redistributed.
The migration plan for a single KV block for layer , KV head , under topology change :
- Source stage in old topology:
- Source TP rank in old topology:
- Source GPU (global rank):
- Destination stage in new topology:
- Destination TP rank in new topology:
- Destination GPU:
The migration is executed layer by layer: after all KV blocks for layer are transferred via batched P2P operations, the old GPU’s storage for that layer is freed immediately — avoiding the need to hold both old and new layouts in memory simultaneously. This streaming approach keeps peak memory overhead at rather than .
Figure 4: Two-dimensional KV cache migration under TP2PP2 → TP1PP4 transition
graph LR
subgraph OLD [Old Topology TP2PP2]
G0o[GPU0 Stage0 TP0]
G1o[GPU1 Stage0 TP1]
G2o[GPU2 Stage1 TP0]
G3o[GPU3 Stage1 TP1]
end
subgraph NEW [New Topology TP1PP4]
G0n[GPU0 Stage0]
G1n[GPU1 Stage1]
G2n[GPU2 Stage2]
G3n[GPU3 Stage3]
end
G0o -->|P2P| G0n
G1o -->|P2P| G0n
G0o -->|P2P| G1n
G1o -->|P2P| G1n
G2o -->|P2P| G2n
G3o -->|P2P| G2n
G2o -->|P2P| G3n
G3o -->|P2P| G3n
Reconfiguration Transaction
Each topology switch is wrapped in a controlled reconfiguration transaction. The transaction is a linearizable sequence of states:
SERVING (T_old)
→ QUIESCING (drain in-flight requests, freeze scheduler)
→ PREPARING_WORKERS (activate/standby/wakeup workers)
→ APPLYING_MPU (load target topology snapshot)
→ [MIGRATE_KV ‖ RELOAD_SHARDS] (parallel)
→ REBINDING (scheduler adapter, block manager)
→ SERVING (T_new)
The critical path optimization is overlapping model-shard reloading with KV cache migration. These two operations touch disjoint memory regions (model parameters vs. KV cache blocks), so they can proceed in parallel:
The saving is , which is significant for large models (where dominates) and for cache-heavy workloads (where dominates).
Figure 5: Reconfiguration transaction state machine and critical path overlap
stateDiagram-v2
[*] --> SERVING_OLD : service running under T_old
SERVING_OLD --> QUIESCING : topology switch request
QUIESCING --> PREPARING_WORKERS : scheduler drained
PREPARING_WORKERS --> MPU_APPLIED : worker set updated
MPU_APPLIED --> MIGRATING : parallel start
MIGRATING --> KV_DONE : KV migration complete
MIGRATING --> SHARDS_DONE : shard reload complete
KV_DONE --> REBINDING : wait for both
SHARDS_DONE --> REBINDING : wait for both
REBINDING --> SERVING_NEW : scheduler resumed
SERVING_NEW --> [*]
note right of MIGRATING
T_kv and T_model run in parallel.
Critical path = max(T_kv, T_model)
end note
The three world-size cases handle the relationship between old and new total GPU counts:
Case 1 — Same total GPUs (): No workers added or removed. Only KV migration, MPU snapshot swap, and shard reload.
Case 2 — Fewer total GPUs (new topology smaller): Extra workers with needed KV state are migrated first; those workers then enter standby.
Case 3 — More total GPUs (new topology larger): Standby workers wake up, synchronize message-queue ring indices, and join the active set.
Shared Weight Store in Detail
Conventional vLLM couples checkpoint loading with topology: each worker loads its shard conditioned on its (pp_rank, tp_rank). To switch topology, you must redo this load from disk.
ReMP’s shared weight store persists the full, unsharded model in CPU shared memory:
Startup:
MPClient → CheckPointManager → SharedStateDict (CPU shared memory)
All workers can read any shard slice via shared-memory IPC
During reconfiguration:
Worker_i reads its target shard W^(tp_rank_new, pp_rank_new)
from SharedStateDict without touching disk
The shared memory region is initialized once at service startup. Its size is the full model parameter count × dtype bytes — for Llama2-70B in BF16, this is approximately 140 GB, which fits within a modern dual-socket server’s host RAM. The read bandwidth from shared memory (CPU-to-GPU PCIe) is the bottleneck for , but this is still orders of magnitude faster than re-reading from SSD.
Key Algorithms Step by Step
Algorithm 1: ReMP Reconfiguration Transaction
Input: T_old = (TP_old, PP_old), T_new = (TP_new, PP_new),
current scheduler state S, KV metadata M, SharedStateDict W
Output: serving system running under T_new with migrated state
1. Signal scheduler to drain: set scheduler.frozen = True
2. Wait for all in-flight forward passes to complete (quiesce)
3. Compute world_size_old = TP_old × PP_old
4. Compute world_size_new = TP_new × PP_new
5. If world_size_new < world_size_old:
a. Identify workers whose KV slices are needed by T_new
b. Execute 2D KV migration from those workers first
c. Move remaining workers to STANDBY state
Elif world_size_new > world_size_old:
a. Wake standby workers, sync message-queue ring index
b. Add woken workers to active set
6. Load target MPU snapshot from MPU State Space; apply to global parallel_state
7. Concurrently:
Thread A: migrate KV cache (Algorithm 2)
Thread B: reload model shards from SharedStateDict
8. Wait for both Thread A and Thread B to finish
9. Invoke Scheduler Adapter:
a. Regenerate KV config (block size, num_blocks, layer-stage map)
b. Update block manager (expand/shrink block pools)
c. Preempt requests whose KV blocks are now invalid
d. Rebuild pipeline-parallel batch queue
10. Set scheduler.frozen = False → resume serving under T_new
Algorithm 2: Two-Dimensional KV Cache Migration
Input: T_old, T_new, set of all live KV blocks B
Output: all blocks physically located at correct GPU under T_new
For each layer ℓ = 0 to L-1:
For each KV block b in B that contains tokens for layer ℓ:
1. Compute s_old = ⌊ℓ / (L / PP_old)⌋
2. For each KV head h covered by block b:
r_old = ⌊h / (H_kv / TP_old)⌋
g_old = s_old × TP_old + r_old
s_new = ⌊ℓ / (L / PP_new)⌋
r_new = ⌊h / (H_kv / TP_new)⌋
g_new = s_new × TP_new + r_new
3. Batch all (g_old, g_new, tensor_slice) for this layer into
a P2P transfer batch
4. Execute batched P2P transfers for layer ℓ
5. Free old layer ℓ storage on source GPUs immediately
(Peak extra memory = O(one layer's KV size))
Experimental Results
Setup
- Hardware: 2 platforms — 8×NVIDIA H100 80GB and 8×NVIDIA RTX 5090 32GB
- Models: Llama2-7B, Llama2-13B, Llama2-70B
- Baselines: restart-based reconfiguration (cold restart)
- Metrics: Switching time (T_switch), TTFT (Time To First Token), TPOT (Time Per Output Token), throughput
Switching Time
On H100 with Llama2-7B, most topology transitions complete in 1–3 seconds. For Llama2-70B (which has 80 layers and 70B parameters to re-shard), switching takes up to 7 seconds — still dramatically faster than the several-minute restart baseline.
Speedup factors (ReMP vs restart) reach:
- 10–30× faster for 7B–13B models across all topology pairs
- Over 100× for several transitions on 7B (e.g., TP1PP8 → TP4PP2 on H100 with 7B completes in ~1.2 seconds vs ~170 seconds restart)
The overlap of KV migration and shard reloading (Eq. 6 vs Eq. 5) accounts for a measurable fraction of this speedup on 70B models where both and are non-trivial.
Dynamic Workload Performance
Under a simulated diurnal traffic pattern (low load 00:00–08:00, peak load 08:00–20:00), ReMP dynamically selects the optimal TP/PP configuration:
- Off-peak: switches to higher TP (e.g., TP4PP2) for lower per-request latency
- Peak: switches to higher PP (e.g., TP1PP8) for higher throughput
Compared to fixed TP1PP8 (throughput-optimized) and fixed TP2PP4 (balanced) baselines, ReMP achieves:
- Lower TTFT: particularly during off-peak hours when TP4 is used
- Lower TPOT: consistent improvement from topology-optimal configuration
- Higher output throughput: especially during peak when PP8 pipeline is activated
Figure 6: Dynamic topology selection under diurnal workload
graph LR
LowLoad[Low Load 00-08h] -->|ReMP switch| TP4PP2[TP4PP2 latency-opt]
PeakLoad[Peak Load 08-20h] -->|ReMP switch| TP1PP8[TP1PP8 throughput-opt]
OffPeak[Low Load 20-24h] -->|ReMP switch| TP4PP2b[TP4PP2 latency-opt]
Fixed[Static Baseline] -->|cannot switch| Stuck[TP1PP8 all day]
Limitations and Boundary Conditions
Memory requirement for shared weight store: The full unsharded model must fit in CPU host memory. Llama2-70B requires ~140 GB in BF16. On machines with less host RAM, this approach is infeasible. Modern serving clusters typically have 256–512 GB host RAM, but edge or resource-constrained deployments cannot use ReMP.
Topology candidates must be pre-specified: The MPU State Space pre-builds communication groups for a fixed set of candidate topologies. Adding a new topology at runtime requires rebuilding those groups (a slow operation). ReMP works best when the operator knows the 2–4 relevant configurations in advance.
vLLM v1 specific: ReMP deeply integrates with vLLM v1’s executor architecture. The techniques generalize, but porting to other engines (SGLang, TensorRT-LLM) requires similar deep integration work.
Scheduler quiescence latency: Step 1 of the transaction drains in-flight requests before reconfiguring. Under high load, some requests may queue for seconds before the drain completes, adding to effective switch latency.
KV cache migration correctness under failures: If a P2P transfer fails mid-migration, the KV cache is in an inconsistent state. The paper does not discuss partial-failure recovery (rollback to ) — this gap could be significant in fault-prone cluster environments.
Critical Assessment: Weaknesses & Improvements
Weaknesses and Flaws
(a) Experimental scope is narrow. All experiments use Llama2-family models on 8-GPU servers. The paper does not test MoE models (Mixtral, DeepSeek), which have expert-parallel (EP) sharding that adds a third parallelism dimension. The claim that ReMP handles “mainstream LLM serving” is overstated given this gap — MoE models represent a large and growing fraction of production deployments.
(b) The diurnal workload is simulated, not from production traffic. The dynamic-workload experiments use a synthetic traffic pattern that follows a simple step function. Real production traffic is bursty, has multiple inflection points per day, and may require topology changes on sub-minute timescales. The paper does not benchmark the system under rapid, unpredictable traffic changes.
(c) No ablation on the parallel overlap savings. The paper claims the overlap of and reduces switching time, but no figure or table isolates this effect. We do not know how much of the speedup comes from the shared weight store vs. the overlap optimization vs. the pre-built MPU snapshots.
(d) The 7-second latency for 70B is still noticeable. For latency-sensitive SLO contracts (e.g., TTFT < 500 ms), even a 7-second reconfiguration pause is problematic. The paper does not discuss how in-flight requests are handled during the quiescence period — are they delayed, dropped, or buffered?
(e) Scheduler preemption side effects are not quantified. When the Scheduler Adapter preempts requests because their KV blocks are now invalid, those requests must redo their prefill computation. The paper does not report how many requests are preempted during typical topology switches, nor the resulting TTFT inflation for affected requests.
Limitations the Authors Understate or Omit
(a) Expert parallelism is completely absent. The related work mentions “TP and PP” throughout, and so does the implementation. Expert parallelism (used by DeepSeek-V3, Mixtral, Qwen-MoE) adds a third routing-dependent sharding that ReMP’s 2D KV migration does not address.
(b) Cross-machine topology changes are not addressed. All experiments use a single 8-GPU machine with NVLink. Production deployments span multiple nodes. Changing topology across a network interconnect (InfiniBand) is orders of magnitude slower for KV migration (PCIe or NIC bandwidth vs. NVLink), which would increase significantly.
(c) The paper assumes a fixed total GPU budget. ReMP can switch (TP, PP) configurations but always uses the same set of physical GPUs. It cannot scale out (add new GPUs) or scale in (return GPUs to a shared pool) — a limitation for cloud-native elastic deployment.
Concrete Improvement Suggestions
(a) Extend to MoE with expert-parallel reconfiguration. The 2D migration (PP layer + TP head) could be extended to a 3D migration (PP layer + TP head + EP expert assignment). This is a natural and high-impact extension given the prevalence of MoE models.
(b) Add a predictive topology controller. ReMP reacts to workload changes. A simple time-series predictor (even a moving average) could trigger topology switches before a traffic peak, eliminating the transient degradation during the quiescence period.
(c) Characterize and minimize preemption impact. A careful ablation measuring preemption rate, affected request fraction, and TTFT inflation for preempted requests would quantify the true cost of reconfiguration on in-flight requests. This should be Table 2 in any revision.
(d) Cross-node migration with compression. For multi-node deployments, KV block migration could use weight-space compression (e.g., INT8) to reduce transfer volume, trading a small accuracy loss for much faster migration over slow network interconnects.
(e) Evaluate fault recovery mid-migration. The paper should describe and evaluate a checkpoint-and-rollback scheme for the KV migration phase to make the system production-safe under partial node failures.
Reproducibility Notes
- Implementation: integrated into vLLM v1. No public code repository is linked in the paper at time of writing.
- Hardware: requires at least 8 GPUs with peer-to-peer CUDA access (NVLink or PCIe P2P).
- Memory: host RAM ≥ 1.5× full model size in BF16 (for 70B, ~210 GB).
- The paper does not specify the NCCL version or CUDA version used; these can affect P2P transfer bandwidth significantly.
Production Traffic Analysis: The Quantitative Motivation
Before diving into the system design, it is worth understanding the quantitative scale of the problem ReMP addresses. The paper analyzes two real production systems.
INFINIGENCE AI service: Traffic at 20:00 is approximately 3.8× the traffic at 04:00. The service runs a model that requires 8 GPUs. At 04:00 off-peak, optimal configuration is TP4PP2 (high TP for low-latency single requests). At 20:00 peak, optimal configuration is TP1PP8 (high PP for throughput). Without dynamic reconfiguration, the operator must choose one configuration and accept:
- TP4PP2 fixed: off-peak service is optimal, but peak throughput is ~40% below TP1PP8 optimal.
- TP1PP8 fixed: peak throughput is optimal, but off-peak TTFT is ~3.1× worse than TP4PP2.
Zhejiang Lab service: Shows a similar diurnal pattern with ~4.2× peak-to-valley ratio. The optimal configuration again flips between high-TP (off-peak) and high-PP (peak).
The key observation is that the traffic patterns are stable across days — while individual days have noise, the average over multiple days is highly predictable. This makes the reconfiguration problem tractable: you can predict when to switch based on time-of-day rather than needing instantaneous traffic prediction.
Memory Overhead Analysis: Feasibility of the Shared Weight Store
The shared weight store is the most memory-expensive component of ReMP. Here is a breakdown for common models:
| Model | Parameters | BF16 Size | FP32 Size | Recommended Host RAM |
|---|---|---|---|---|
| Llama2-7B | 7B | ~14 GB | ~28 GB | ≥ 21 GB (1.5×) |
| Llama2-13B | 13B | ~26 GB | ~52 GB | ≥ 39 GB (1.5×) |
| Llama2-70B | 70B | ~140 GB | ~280 GB | ≥ 210 GB (1.5×) |
| Llama3-405B | 405B | ~810 GB | N/A (too large) | ≥ 1.2 TB (1.5×) |
For Llama3-405B, the shared weight store alone requires over 1 TB of host RAM — a server-class requirement that many GPU clusters do not meet. This is a real scalability boundary for ReMP’s current design.
A practical mitigation is to use quantized weights in the shared store (e.g., INT8 or even INT4 for parameters that tolerate quantization) and dequantize before loading onto GPU. This would reduce the store by 2–4× at the cost of a dequantization step during shard reload.
Deep Dive: Why KV Cache Migration Is Hard
It is worth pausing to appreciate why KV cache migration is technically challenging, because this is where most of ReMP’s novelty lives.
The Consistency Problem
A KV cache block is not an opaque blob of bytes. It encodes semantic content — the key and value vectors for a specific set of tokens, computed by a specific layer, using a specific attention head’s projection. If you copy a block from GPU to GPU and GPU has a different TP rank or PP stage, the block’s head indices are wrong in the new context. A naive copy produces semantically incorrect attention — the model would attend over misaligned head slices and produce garbage outputs.
ReMP’s 2D mapping (Eqs. 5–8) ensures that each block arrives at the GPU that owns the correct (stage, TP rank) pair for its (layer, head) combination. This is the “semantic-aware migration” the paper refers to.
The Memory Efficiency Problem
A naive approach to KV migration would be:
- Allocate new KV memory for the new topology on all GPUs.
- Copy all blocks from old layout to new layout.
- Free old memory.
This requires holding both old and new KV layouts in GPU memory simultaneously — potentially doubling memory pressure. For a fully loaded server (KV cache consuming 40–50 GB of GPU memory), this would OOM most systems.
ReMP solves this with layer-by-layer streaming: for each layer , copy its blocks, then immediately free the old GPU’s storage for that layer before moving to . The peak extra memory is bounded by a single layer’s KV contribution:
where is the total active KV context across all live requests. For a 70B model with 8 KV heads, , 10K tokens of context, BF16:
This is negligible compared to total GPU memory, confirming that the layer-by-layer approach is memory-safe even on a fully loaded server.
The Ordering Problem
KV blocks are organized as PagedAttention pages. Each page is a (block_size × d_model) tensor associated with a logical sequence at a specific layer. When TP changes, the page boundary along the head dimension changes: what was a complete head slice under TP2 becomes only half a head’s worth of data in a TP4 shard.
ReMP handles this at the block-manager level: the Scheduler Adapter regenerates the block table’s head-slice offsets for the new TP configuration after migration completes. This is why the Scheduler Adapter must run after the KV migration engine finishes — the block table update and the physical data migration are coordinated in the same transaction.
Performance Analysis: When Does ReMP Help Most?
The speedup from ReMP over restart-based switching comes from three sources, each dominant in different regimes:
Source 1: Eliminating Checkpoint Disk I/O
Restart requires loading the checkpoint from disk (NVMe or distributed storage). Even fast NVMe achieves ~7 GB/s sequential read. For Llama2-70B (280 GB checkpoint with optimizer states, or 140 GB for inference weights in BF16), checkpoint loading alone takes 20–30 seconds under ideal conditions and much longer on network-attached storage.
ReMP’s shared weight store eliminates this entirely — CPU-to-GPU PCIe bandwidth (~32 GB/s for PCIe 4.0 ×16) is the new bottleneck for model reloading.
where and . So even ignoring other restart costs, the shard-reload phase is ~ times faster.
Source 2: Eliminating NCCL Group Re-initialization
NCCL group creation (ncclCommInitAll) involves a broadcast-based handshake across all participating processes. For 8 GPUs, this typically takes 3–15 seconds depending on GPU interconnect topology and NCCL version. Pre-building groups in the MPU State Space eliminates this entirely — group activation is an pointer swap.
Source 3: KV Cache Preservation
Without migration, restart forces every in-flight request to redo prefill computation. For a long-context request (e.g., 32K prompt), prefill on a single H100 takes on the order of 1–5 seconds. If 50 such requests were in-flight, the “recomputation debt” is enormous.
ReMP migrates KV blocks in-place, so in-flight requests resume from where they left off. The only exception is preempted requests (whose blocks were on workers that changed world size) — but these are a small fraction of total requests in most transitions.
The TP vs PP Trade-off: A Quantitative View
To fully appreciate why dynamic topology switching matters, consider a concrete model of the TP/PP throughput trade-off.
Under Tensor Parallelism with degree , the communication overhead per transformer layer is one AllReduce of size (column-then-row parallel). At batch size and sequence length , the effective throughput is approximately:
where (linear speedup from TP), and (AllReduce cost grows with ). At small batch sizes (low load), dominates, so more TP helps. At large batch sizes (high load), becomes significant, making more TP harmful.
Under Pipeline Parallelism with degree , the pipeline bubble fraction is:
where is the number of micro-batches. At high concurrency (large ), the bubble is negligible and PP primarily serves to partition memory. At low concurrency (small ), the bubble waste is proportional to , making PP inefficient.
This quantitative picture explains the data in Figure 2: at low load ( small, small), TP dominates. At high load ( large, large), PP dominates.
The Worker Lifecycle: A Closer Look at the Three-State Machine
The Worker Lifecycle Manager is subtler than it appears. Workers cannot simply be started and stopped because CUDA context initialization — allocating GPU memory, creating NCCL communicators, loading CUDA kernels — takes on the order of 30–60 seconds for large models.
ReMP avoids this by keeping workers alive even when they are not needed:
Active worker: running forward passes, owns GPU memory and NCCL connections
↓ (world size shrinks) ↑ (world size grows)
Standby worker: process alive, GPU memory still allocated,
NCCL groups still initialized, but not receiving work
↑ "wakeup" message: sync ring index, rejoin active set
The wakeup procedure synchronizes the message-queue ring index so that standby workers can immediately receive new control messages and KV transfer requests without missing any messages that were sent while they were inactive. This ring-index sync is a very fast operation (a single shared-memory write) that completes in milliseconds.
The standby mechanism does have a cost: standby workers hold GPU memory even while inactive. If TP8PP1 is a candidate topology but the system spends most of its time in TP1PP8 (fewer active workers), the standby workers idle with their GPU memory allocated and unusable by other workloads. This is a form of memory reservation that reduces the effective utilization of the cluster.
For example, on an 8-GPU machine running TP2PP4 (8 active workers, no standbys), switching to TP4PP2 (still 8 workers, just different topology) has no standby cost. But switching from TP4PP2 to TP2PP2 (from 8 workers to 4 workers) leaves 4 standby workers holding their GPU memory reservation for when the system might switch back.
Handling World Size Changes: The Three Cases in Detail
The most complex coordination happens when the number of active workers changes. Here is a detailed walkthrough of each case.
Case 1: Same world size (e.g., TP2PP4 → TP4PP2, both 8 workers)
This is the simplest case. The same 8 GPU processes participate, just organized differently. Steps:
- Freeze scheduler, quiesce in-flight passes.
- Apply target MPU snapshot (NCCL groups are already built).
- Concurrently: (A) 2D KV migration, (B) reload model shards from shared store.
- Wait for both. Rebind scheduler.
There is no worker state change at all — only KV data and model shards move.
Case 2: Fewer workers (e.g., TP4PP2 → TP4PP1, from 8 to 4 workers)
Workers 4, 5, 6, 7 will become standby. But they currently hold KV cache for layers through in their pipeline stage. The new TP4PP1 topology has only one pipeline stage (all 4 active workers), which must hold all layers. So layers through must be migrated from workers 4–7 to workers 0–3 before workers 4–7 enter standby.
- Compute which layers workers 4–7 own that are needed by workers 0–3.
- Execute KV migration for those layers first.
- Workers 4–7 → standby (they can release their layer through KV memory after migration).
- Workers 0–3 apply target MPU state, reload shards.
- Rebind scheduler.
Case 3: More workers (e.g., TP4PP1 → TP4PP2, from 4 to 8 workers)
Workers 4–7 are in standby. They need to join the active pipeline as stage 1.
- Wake workers 4–7: send wakeup message, sync ring index.
- Workers 4–7 allocate GPU memory for their new pipeline stage’s KV blocks.
- Apply target MPU snapshot (including PP groups that include workers 4–7).
- KV migration: transfer layers through KV data from workers 0–3 → workers 4–7.
- Reload model shards: workers 4–7 load layers through from shared store.
- Rebind scheduler.
Comparison with Related Work
| System | Dynamic reconfiguration | KV Cache preservation | TP adjustment | PP adjustment |
|---|---|---|---|---|
| vLLM (baseline) | ❌ | ❌ (restart) | ❌ | ❌ |
| Llumnix | Request migration (same topology) | ✅ (within topology) | ❌ | ❌ |
| SpotServe | Elasticity on preemption | Partial | ❌ | ❌ |
| PipeLive | Pipeline reconfiguration | Partial | ❌ | ✅ (PP only) |
| ReMP | ✅ TP+PP jointly | ✅ 2D migration | ✅ | ✅ |
The key differentiator is that ReMP jointly reconfigures both TP and PP dimensions with KV cache preservation. PipeLive handles PP reconfiguration but not TP. Llumnix migrates requests between instances but does not change the serving topology. ReMP is the first system to treat the full TP×PP product as a runtime-adjustable parameter.
Integration Design: What ReMP Changes in vLLM
ReMP requires changes to four core subsystems of vLLM v1. Understanding the scope of these changes helps assess portability to other engines.
Executor: The executor is vLLM’s central control plane that dispatches inference steps to workers. ReMP adds a ReconfigurationController that intercepts the executor’s main loop, quiesces the scheduler, and manages the transaction lifecycle. This requires access to vLLM’s internal executor APIs — not a public interface, making the integration version-specific.
Worker management: vLLM v1 manages worker processes via a Ray-based distributed executor. ReMP extends the worker state machine with the standby/wakeup states, requiring modifications to vLLM’s worker spawn and process group management code.
KV cache manager: vLLM’s BlockSpaceManager tracks which KV blocks belong to which sequences. After a topology switch, block addresses change (different GPU, different offset within that GPU’s KV pool). ReMP extends the block manager to accept a “remapping” transaction that updates all block-to-sequence associations atomically.
Communication initialization: vLLM initializes NCCL groups once during startup via initialize_model_parallel(). ReMP replaces this with a multi-topology initialization that builds all candidate topology groups upfront and stores them in the MPU State Space.
For portability to SGLang or TensorRT-LLM, similar changes would be needed in each engine’s analogous subsystems. The conceptual design is general, but the implementation is tightly coupled to vLLM v1 internals.
Reproducibility Notes
- Implementation: integrated into vLLM v1. No public code repository is linked in the paper at time of writing.
- Hardware: requires at least 8 GPUs with peer-to-peer CUDA access (NVLink or PCIe P2P).
- Memory: host RAM ≥ 1.5× full model size in BF16 (for 70B, ~210 GB).
- The paper does not specify the NCCL version or CUDA version used; these can affect P2P transfer bandwidth significantly.
- Replicating the dynamic workload experiments requires a traffic simulator calibrated to the specific diurnal patterns used.
Conclusion
ReMP solves a real and under-addressed problem: in production LLM serving, the TP/PP topology has always been a fixed configuration, even though optimal topology changes with load. By decoupling model weights, KV cache, communication groups, and worker processes from the topology, and by introducing two-dimensional KV cache migration with a critical-path-overlapping transaction protocol, ReMP achieves topology switches in seconds rather than minutes. The result is an adaptive serving system that can chase the optimal parallelism configuration across the diurnal cycle.
The memory-safety of layer-by-layer streaming migration, the topology snapshot swap, and the overlap of with are all elegant engineering choices that individually appear incremental but combine to make online reconfiguration practical. The main gaps — MoE expert-parallel support, cross-node migration, elastic GPU scaling — are natural next steps that the authors can pursue from this solid foundation.
For practitioners: ReMP is most impactful on high-traffic production deployments where traffic varies by 3× or more across the day, and where the model is large enough that the optimal TP/PP setting genuinely differs between peak and off-peak. For lightly-loaded systems or small models (where a single high-TP configuration already handles all traffic), the benefit is smaller.
Summary of Key Formulas
For reference, the main mathematical relations used throughout this note:
KV cache memory footprint (Eq. 1):
TP weight sharding — Query projection split across TP degree (Eq. 3):
PP layer ownership for stage (Eq. 4):
Source GPU for KV block (layer , head ) under old topology (Eq. 5–6):
Switching time with vs without parallel overlap (Eq. 9–10):
Pipeline bubble fraction (Eq. 15):
These formulas collectively capture the memory-compute-communication trade-offs that motivate and enable ReMP’s design. The KV migration mapping (Eqs. 5–8, the source-and-destination GPU derivation) is the mathematical heart of the system — the rest of ReMP is engineering infrastructure built to efficiently execute that mapping at runtime, without full service reconstruction, while staying memory-safe and semantically correct at every step of the transition.
The TP/PP product () sets the total GPU footprint. ReMP treats this footprint as constant but lets the distribution between and vary — a degree of freedom that prior systems did not expose. Making that degree of freedom accessible in production, with acceptable switching cost, is the paper’s lasting contribution.
As LLM serving matures and the mix of dense and MoE models diversifies, the generalization of ReMP’s approach to multi-dimensional parallelism (TP × PP × EP for MoE, TP × PP × SP for sequence parallelism) will become increasingly important. The 2D migration framework established here is the right foundation for that generalization.