ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

Review date: 2026-06-25 Review author: Zhongzhu Zhou Paper reviewed: ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving Paper authors: Haipeng Yuan, Kaining Zheng, Yongshu Bai, Yuchen Zhang, Yunquan Zhang, Baodong Wu, Xiang Gao, Daning Cheng arXiv: 2606.18741 Status/Venue: arXiv preprint, June 2026 (ICT/Chinese Academy of Sciences, Zhejiang Lab, Infinigence AI)

Short Answer

Every production LLM serving system today hard-wires its Tensor Parallelism (TP) and Pipeline Parallelism (PP) configuration at launch time. Changing from TP4PP2 to TP2PP4 requires restarting the service — which takes several minutes and loses the entire KV cache. ReMP breaks this constraint. It decouples four normally static runtime states — model weights, KV cache, communication groups, and worker lifetimes — from the launch-time topology, then re-establishes each under the new configuration without tearing down the process. The result: topology switches complete in 1–7 seconds on models ranging from 7B to 70B parameters, 10–100× faster than restart, while preserving previously computed KV cache states to avoid expensive prefill recomputation.

Prerequisites: What You Need to Know First

ReMP is a systems paper that sits at the intersection of distributed model parallelism and LLM serving. To understand why it is hard and why the techniques work, you need to understand five building blocks.

1. Autoregressive Decoding and the KV Cache

Transformer-based language models generate tokens one at a time. At decode step tt, the model receives all past tokens and produces one new token. The core computation is attention over the full history. Without any optimization, computing key and value projections for all past tokens at every step costs O(T2)O(T^2) where TT is the sequence length.

The KV cache eliminates this waste: after a token’s key and value are computed in a given layer, they are stored in GPU memory and reused at every future step. This reduces per-step attention from O(T2)O(T^2) to O(T)O(T) but introduces a memory cost:

MKV=2LHkvdkvTBsizeof(dtype)(1)M_\text{KV} = 2 \cdot L \cdot H_\text{kv} \cdot d_\text{kv} \cdot T \cdot B \cdot \text{sizeof}(\text{dtype}) \tag{1}

where LL is the number of transformer layers, HkvH_\text{kv} is the number of KV heads per layer, dkvd_\text{kv} is the head dimension, TT is the context length, and BB is the batch size. For Llama2-70B (80 layers, 8 KV heads, dkv=128d_\text{kv} = 128) with a 4K context, batch 32, BF16:

MKV=2×80×8×128×4096×32×210.7 GB(2)M_\text{KV} = 2 \times 80 \times 8 \times 128 \times 4096 \times 32 \times 2 \approx 10.7 \text{ GB} \tag{2}

The KV cache layout is inseparable from the model parallelism topology — it determines which GPU stores which layers and which attention heads. This coupling is the root cause of ReMP’s problem.

2. Tensor Parallelism (TP)

Tensor parallelism (Megatron-LM, 2021) splits individual weight matrices across multiple GPUs along a dimension so that each GPU computes a slice of every layer’s output.

For the attention projection WQRd×HdkW_Q \in \mathbb{R}^{d \times H \cdot d_k}, TP splits across the head dimension:

WQ=[WQ(0)WQ(1)WQ(r1)](3)W_Q = [W_Q^{(0)} \mid W_Q^{(1)} \mid \cdots \mid W_Q^{(r-1)}] \tag{3}

GPU ii holds slice WQ(i)Rd×(H/r)dkW_Q^{(i)} \in \mathbb{R}^{d \times (H/r) \cdot d_k} and computes Q(i)=XWQ(i)Q^{(i)} = X W_Q^{(i)}, where rr is the TP degree. After each attention sub-computation, an AllReduce across the rr GPUs collects the full output.

The key trade-off:

  • Higher TP degree → smaller per-GPU memory, faster single-request latency (more GPUs share computation), but more frequent AllReduce communication overhead.
  • Lower TP degree → less communication overhead, higher single-GPU utilization at scale.

In KV cache terms, TP rank ii on a given pipeline stage owns KV heads [iHkv/r,(i+1)Hkv/r)[i \cdot H_\text{kv}/r, (i+1) \cdot H_\text{kv}/r) for every layer on that stage.

3. Pipeline Parallelism (PP)

Pipeline parallelism (GPipe, PipeDream) partitions the transformer’s LL layers across pp pipeline stages. Stage ss owns layers:

layers(s)=[sL/p,(s+1)L/p)(4)\text{layers}(s) = [s \cdot \lfloor L/p \rfloor, \,(s+1) \cdot \lfloor L/p \rfloor) \tag{4}

During inference, a request flows through stages 0 → 1 → … → p1p-1. Multiple micro-batches can overlap in the pipeline to improve GPU utilization (the “pipeline bubble” trade-off).

The KV cache for a given request is partitioned across stages: stage ss stores the KV blocks for its own layers. When TP changes the number of active workers per stage, or PP changes layer ownership between stages, KV cache blocks stored on old GPUs must be physically moved to the correct new GPUs.

4. vLLM and PagedAttention

vLLM (Kwon et al., 2023) is the dominant open-source LLM serving engine. Its key contribution, PagedAttention, manages the KV cache as a pool of fixed-size blocks (like OS virtual memory pages), eliminating fragmentation and enabling dynamic block allocation. Each block holds the KV state for a contiguous chunk of tokens.

ReMP is implemented on top of vLLM v1. The integration touches vLLM’s executor, worker management, KV cache manager, and communication initialization pipeline.

5. The Static Topology Problem

When vLLM (or any serving engine) launches with TP2PP4, the following are all initialized with respect to that specific configuration:

  • Weight tensors: each GPU loads its fixed shard.
  • Communication groups: NCCL groups are created for TP AllReduce and PP peer-to-peer.
  • KV cache blocks: allocated and indexed with a specific layer-to-stage and head-to-TP-rank mapping.
  • Worker processes: exactly TP × PP = 8 workers, each bound to one (pp_rank, tp_rank).

There is no mechanism to change any of these without shutting down and restarting from scratch. In production, restart takes 2–10 minutes (checkpoint reload, CUDA initialization, NCCL group setup, memory allocation). All accumulated KV cache is lost, forcing every in-flight request to redo its prefill computation.

Figure 1: Static topology coupling vs. ReMP’s decoupled design

graph TD
    subgraph Static["Static vLLM (Current)"]
        WT["Weight Shards<br/>(bound to topology)"] --> T["TP/PP Topology<br/>(fixed at launch)"]
        KV["KV Cache Layout<br/>(bound to topology)"] --> T
        CG["NCCL Comm Groups<br/>(bound to topology)"] --> T
        WK["Worker Processes<br/>(bound to topology)"] --> T
    end
    subgraph ReMP["ReMP (Decoupled)"]
        SWS["Shared Weight Store<br/>(CPU shared memory)"] --> RC["Reconfiguration Controller"]
        MSS["MPU State Space<br/>(multi-topology snapshots)"] --> RC
        KME["KV Migration Engine<br/>(2D remap)"] --> RC
        WLM["Worker Lifecycle Mgr<br/>(standby/wakeup)"] --> RC
        SA["Scheduler Adapter<br/>(rebinds after switch)"] --> RC
    end
    T -. "restart required (minutes)" .-> T
    RC -. "online switch (seconds)" .-> RC

Problem Statement: Dynamic Workloads vs Static Topology

ReMP’s motivation is rooted in a real production observation. Real-world LLM serving traffic is not constant — it follows diurnal patterns. The paper shows traffic data from two production systems (INFINIGENCE AI and Zhejiang Lab) where traffic at 20:00 is roughly 3–5× higher than at 04:00.

The critical insight: the optimal TP/PP configuration is traffic-dependent.

  • Low traffic (few concurrent requests): Latency is the bottleneck. Minimize per-request processing time by using high TP degree — more GPUs share each request’s computation, reducing TTFT (Time To First Token).
  • High traffic (many concurrent requests): Throughput is the bottleneck. Use high PP degree — pipeline stages process different requests in parallel, maximizing tokens/second, while TP’s AllReduce overhead per step becomes the limiting factor.

A fixed TP4PP2 configuration, optimized for low traffic, degrades throughput at peak. A fixed TP1PP8 configuration, optimized for peak, has unacceptable latency during off-peak. Because restart costs minutes, production systems are forced to pick one configuration and live with suboptimal performance for part of the day.

Figure 2: TP vs. PP trade-off under different traffic regimes

graph LR
    subgraph LowLoad["Low Traffic"]
        direction TB
        L1["High TP degree<br/>(e.g., TP4PP2)"]
        L2["More GPUs per request<br/>→ faster single-request latency"]
        L3["TTFT ↓, TPOT ↓"]
        L1 --> L2 --> L3
    end
    subgraph HighLoad["High Traffic"]
        direction TB
        H1["High PP degree<br/>(e.g., TP1PP8)"]
        H2["More pipeline stages<br/>→ higher concurrency"]
        H3["Throughput ↑, fewer AllReduce"]
        H1 --> H2 --> H3
    end
    LowLoad -. "traffic spike" .-> HighLoad
    HighLoad -. "traffic drops" .-> LowLoad
    style LowLoad fill:#e8f4e8
    style HighLoad fill:#fde8e8

Core Contribution

ReMP makes TP/PP topology a runtime-adjustable resource rather than a launch-time parameter. It achieves this through three coupled innovations:

  1. State decoupling: Break the hard binding between topology and the four runtime states (weights, KV cache, communication groups, workers).
  2. Two-dimensional KV cache migration: Semantically remap and physically transfer KV blocks along both the PP (layer) and TP (KV head) dimensions simultaneously.
  3. End-to-end integration in vLLM v1: A complete implementation covering executor, workers, cache manager, and comm initialization.

System Design

Architecture: Six Components

ReMP augments vLLM’s serving engine with six new modules:

Figure 3: ReMP architecture and component interactions

graph TB
    RC["Reconfiguration Controller<br/>(entry point, transaction coord)"]
    SWS["Shared Weight Store<br/>(full model in CPU shared mem)"]
    MSS["MPU State Space<br/>(pre-built topology snapshots)"]
    WLM["Worker Lifecycle Manager<br/>(active / standby / wakeup)"]
    KME["KV Migration Engine<br/>(2D remap + P2P transfer)"]
    SA["Scheduler Adapter<br/>(block mgr rebind + preempt)"]

    RC --> SWS
    RC --> MSS
    RC --> WLM
    RC --> KME
    RC --> SA

    SWS -. "CPU SharedStateDict<br/>reloads target GPU shards" .-> WLM
    MSS -. "apply target topology<br/>NCCL groups + rank maps" .-> WLM
    KME -. "layer-by-layer P2P<br/>batched transfers" .-> WLM
    SA -. "rebuild block table<br/>preempt if needed" .-> RC

Reconfiguration Controller: The transaction entry point. Receives the target (TP, PP) pair, orchestrates all other components, and ensures that model shards, KV cache, communication state, and scheduler metadata are mutually consistent before resuming serving.

Shared Weight Store: On startup, the full model checkpoint is loaded into CPU shared memory as a SharedStateDict. When a topology switch occurs, each worker reads its target shard directly from this shared store rather than re-loading checkpoint files from disk. This eliminates the dominant latency of restart: disk I/O and checkpoint deserialization.

MPU (Model Parallel Unit) State Space: For each candidate topology (e.g., TP1PP8, TP2PP4, TP4PP2, TP8PP1), a parallel-state snapshot is pre-built at startup. Each snapshot contains:

  • TP group membership and inter-GPU process groups
  • PP peer ranks (prev/next stage)
  • Rank-to-device mapping
  • Per-topology NCCL communicator handles

Switching means selecting the correct snapshot and applying it to vLLM’s global parallel_state — an O(1)O(1) operation rather than a full NCCL group rebuild.

Worker Lifecycle Manager: Manages three worker states — active, standby, and waking_up. When the new topology needs fewer GPUs (e.g., TP8PP1 → TP4PP1), extra workers enter standby after their KV state is migrated away. When more GPUs are needed, standby workers wake up and synchronize the message-queue ring index so they can receive control messages immediately.

KV Migration Engine: The most technically complex component. It computes the 2D remapping plan and executes batched P2P GPU-to-GPU transfers.

Scheduler Adapter: After a topology switch, the block manager’s KV block table must reflect the new memory layout. The adapter:

  1. Regenerates the KV cache configuration (block size, num_blocks per GPU, layer-to-stage mapping).
  2. Expands or shrinks per-GPU block pools.
  3. Preempts running requests whose KV blocks now span workers that no longer exist.
  4. Refreshes the pipeline-parallel batch queue.

Two-Dimensional KV Cache Migration

The KV cache layout is governed by two independent dimensions that both change when topology changes:

  • Pipeline Parallelism (layer dimension): Stage ss in a pp-stage pipeline owns layers [sL/p,(s+1)L/p)[s \cdot L/p, (s+1) \cdot L/p). When pp changes, every layer potentially moves to a new stage on a new GPU.
  • Tensor Parallelism (head dimension): TP rank ii in a rr-rank tensor-parallel group owns KV heads [iHkv/r,(i+1)Hkv/r)[i \cdot H_\text{kv}/r, (i+1) \cdot H_\text{kv}/r) of every layer on its stage. When rr changes, the head slices are redistributed.

The migration plan for a single KV block for layer \ell, KV head hh, under topology change (TPold,PPold)(TPnew,PPnew)(TP_\text{old}, PP_\text{old}) \to (TP_\text{new}, PP_\text{new}):

  1. Source stage in old topology: sold=/(L/PPold)s_\text{old} = \lfloor \ell / (L / PP_\text{old}) \rfloor
  2. Source TP rank in old topology: rold=h/(Hkv/TPold)r_\text{old} = \lfloor h / (H_\text{kv} / TP_\text{old}) \rfloor
  3. Source GPU (global rank): gold=soldTPold+roldg_\text{old} = s_\text{old} \cdot TP_\text{old} + r_\text{old}
  4. Destination stage in new topology: snew=/(L/PPnew)s_\text{new} = \lfloor \ell / (L / PP_\text{new}) \rfloor
  5. Destination TP rank in new topology: rnew=h/(Hkv/TPnew)r_\text{new} = \lfloor h / (H_\text{kv} / TP_\text{new}) \rfloor
  6. Destination GPU: gnew=snewTPnew+rnewg_\text{new} = s_\text{new} \cdot TP_\text{new} + r_\text{new}

The migration is executed layer by layer: after all KV blocks for layer \ell are transferred via batched P2P operations, the old GPU’s storage for that layer is freed immediately — avoiding the need to hold both old and new layouts in memory simultaneously. This streaming approach keeps peak memory overhead at O(one layer’s KV size)O(\text{one layer's KV size}) rather than O(full model KV size)O(\text{full model KV size}).

Figure 4: Two-dimensional KV cache migration under TP2PP2 → TP1PP4 transition

graph LR
    subgraph OLD [Old Topology TP2PP2]
        G0o[GPU0 Stage0 TP0]
        G1o[GPU1 Stage0 TP1]
        G2o[GPU2 Stage1 TP0]
        G3o[GPU3 Stage1 TP1]
    end
    subgraph NEW [New Topology TP1PP4]
        G0n[GPU0 Stage0]
        G1n[GPU1 Stage1]
        G2n[GPU2 Stage2]
        G3n[GPU3 Stage3]
    end
    G0o -->|P2P| G0n
    G1o -->|P2P| G0n
    G0o -->|P2P| G1n
    G1o -->|P2P| G1n
    G2o -->|P2P| G2n
    G3o -->|P2P| G2n
    G2o -->|P2P| G3n
    G3o -->|P2P| G3n

Reconfiguration Transaction

Each topology switch is wrapped in a controlled reconfiguration transaction. The transaction is a linearizable sequence of states:

SERVING (T_old)
  → QUIESCING (drain in-flight requests, freeze scheduler)
  → PREPARING_WORKERS (activate/standby/wakeup workers)
  → APPLYING_MPU (load target topology snapshot)
  → [MIGRATE_KV ‖ RELOAD_SHARDS] (parallel)
  → REBINDING (scheduler adapter, block manager)
  → SERVING (T_new)

The critical path optimization is overlapping model-shard reloading with KV cache migration. These two operations touch disjoint memory regions (model parameters vs. KV cache blocks), so they can proceed in parallel:

Tswitchseq=Tworker+Tmpu+Tkv+Tmodel+Tsched(5)T_\text{switch}^\text{seq} = T_\text{worker} + T_\text{mpu} + T_\text{kv} + T_\text{model} + T_\text{sched} \tag{5} Tswitchoverlap=Tworker+Tmpu+max(Tkv,Tmodel)+Tsched(6)T_\text{switch}^\text{overlap} = T_\text{worker} + T_\text{mpu} + \max(T_\text{kv}, T_\text{model}) + T_\text{sched} \tag{6}

The saving is min(Tkv,Tmodel)\min(T_\text{kv}, T_\text{model}), which is significant for large models (where TmodelT_\text{model} dominates) and for cache-heavy workloads (where TkvT_\text{kv} dominates).

Figure 5: Reconfiguration transaction state machine and critical path overlap

stateDiagram-v2
    [*] --> SERVING_OLD : service running under T_old
    SERVING_OLD --> QUIESCING : topology switch request
    QUIESCING --> PREPARING_WORKERS : scheduler drained
    PREPARING_WORKERS --> MPU_APPLIED : worker set updated
    MPU_APPLIED --> MIGRATING : parallel start
    MIGRATING --> KV_DONE : KV migration complete
    MIGRATING --> SHARDS_DONE : shard reload complete
    KV_DONE --> REBINDING : wait for both
    SHARDS_DONE --> REBINDING : wait for both
    REBINDING --> SERVING_NEW : scheduler resumed
    SERVING_NEW --> [*]
    note right of MIGRATING
        T_kv and T_model run in parallel.
        Critical path = max(T_kv, T_model)
    end note

The three world-size cases handle the relationship between old and new total GPU counts:

Case 1 — Same total GPUs (TPold×PPold=TPnew×PPnewTP_\text{old} \times PP_\text{old} = TP_\text{new} \times PP_\text{new}): No workers added or removed. Only KV migration, MPU snapshot swap, and shard reload.

Case 2 — Fewer total GPUs (new topology smaller): Extra workers with needed KV state are migrated first; those workers then enter standby.

Case 3 — More total GPUs (new topology larger): Standby workers wake up, synchronize message-queue ring indices, and join the active set.

Shared Weight Store in Detail

Conventional vLLM couples checkpoint loading with topology: each worker loads its shard conditioned on its (pp_rank, tp_rank). To switch topology, you must redo this load from disk.

ReMP’s shared weight store persists the full, unsharded model in CPU shared memory:

Startup:
  MPClient → CheckPointManager → SharedStateDict (CPU shared memory)
  All workers can read any shard slice via shared-memory IPC

During reconfiguration:
  Worker_i reads its target shard W^(tp_rank_new, pp_rank_new)
  from SharedStateDict without touching disk

The shared memory region is initialized once at service startup. Its size is the full model parameter count × dtype bytes — for Llama2-70B in BF16, this is approximately 140 GB, which fits within a modern dual-socket server’s host RAM. The read bandwidth from shared memory (CPU-to-GPU PCIe) is the bottleneck for TmodelT_\text{model}, but this is still orders of magnitude faster than re-reading from SSD.

Key Algorithms Step by Step

Algorithm 1: ReMP Reconfiguration Transaction

Input: T_old = (TP_old, PP_old), T_new = (TP_new, PP_new),
       current scheduler state S, KV metadata M, SharedStateDict W
Output: serving system running under T_new with migrated state

1. Signal scheduler to drain: set scheduler.frozen = True
2. Wait for all in-flight forward passes to complete (quiesce)
3. Compute world_size_old = TP_old × PP_old
4. Compute world_size_new = TP_new × PP_new
5. If world_size_new < world_size_old:
     a. Identify workers whose KV slices are needed by T_new
     b. Execute 2D KV migration from those workers first
     c. Move remaining workers to STANDBY state
   Elif world_size_new > world_size_old:
     a. Wake standby workers, sync message-queue ring index
     b. Add woken workers to active set
6. Load target MPU snapshot from MPU State Space; apply to global parallel_state
7. Concurrently:
     Thread A: migrate KV cache (Algorithm 2)
     Thread B: reload model shards from SharedStateDict
8. Wait for both Thread A and Thread B to finish
9. Invoke Scheduler Adapter:
     a. Regenerate KV config (block size, num_blocks, layer-stage map)
     b. Update block manager (expand/shrink block pools)
     c. Preempt requests whose KV blocks are now invalid
     d. Rebuild pipeline-parallel batch queue
10. Set scheduler.frozen = False → resume serving under T_new

Algorithm 2: Two-Dimensional KV Cache Migration

Input: T_old, T_new, set of all live KV blocks B
Output: all blocks physically located at correct GPU under T_new

For each layer ℓ = 0 to L-1:
  For each KV block b in B that contains tokens for layer ℓ:
    1. Compute s_old = ⌊ℓ / (L / PP_old)⌋
    2. For each KV head h covered by block b:
         r_old = ⌊h / (H_kv / TP_old)⌋
         g_old = s_old × TP_old + r_old

         s_new = ⌊ℓ / (L / PP_new)⌋
         r_new = ⌊h / (H_kv / TP_new)⌋
         g_new = s_new × TP_new + r_new

    3. Batch all (g_old, g_new, tensor_slice) for this layer into
       a P2P transfer batch
  4. Execute batched P2P transfers for layer ℓ
  5. Free old layer ℓ storage on source GPUs immediately
  (Peak extra memory = O(one layer's KV size))

Experimental Results

Setup

  • Hardware: 2 platforms — 8×NVIDIA H100 80GB and 8×NVIDIA RTX 5090 32GB
  • Models: Llama2-7B, Llama2-13B, Llama2-70B
  • Baselines: restart-based reconfiguration (cold restart)
  • Metrics: Switching time (T_switch), TTFT (Time To First Token), TPOT (Time Per Output Token), throughput

Switching Time

On H100 with Llama2-7B, most topology transitions complete in 1–3 seconds. For Llama2-70B (which has 80 layers and 70B parameters to re-shard), switching takes up to 7 seconds — still dramatically faster than the several-minute restart baseline.

Speedup factors (ReMP vs restart) reach:

  • 10–30× faster for 7B–13B models across all topology pairs
  • Over 100× for several transitions on 7B (e.g., TP1PP8 → TP4PP2 on H100 with 7B completes in ~1.2 seconds vs ~170 seconds restart)

The overlap of KV migration and shard reloading (Eq. 6 vs Eq. 5) accounts for a measurable fraction of this speedup on 70B models where both TkvT_\text{kv} and TmodelT_\text{model} are non-trivial.

Dynamic Workload Performance

Under a simulated diurnal traffic pattern (low load 00:00–08:00, peak load 08:00–20:00), ReMP dynamically selects the optimal TP/PP configuration:

  • Off-peak: switches to higher TP (e.g., TP4PP2) for lower per-request latency
  • Peak: switches to higher PP (e.g., TP1PP8) for higher throughput

Compared to fixed TP1PP8 (throughput-optimized) and fixed TP2PP4 (balanced) baselines, ReMP achieves:

  • Lower TTFT: particularly during off-peak hours when TP4 is used
  • Lower TPOT: consistent improvement from topology-optimal configuration
  • Higher output throughput: especially during peak when PP8 pipeline is activated

Figure 6: Dynamic topology selection under diurnal workload

graph LR
    LowLoad[Low Load 00-08h] -->|ReMP switch| TP4PP2[TP4PP2 latency-opt]
    PeakLoad[Peak Load 08-20h] -->|ReMP switch| TP1PP8[TP1PP8 throughput-opt]
    OffPeak[Low Load 20-24h] -->|ReMP switch| TP4PP2b[TP4PP2 latency-opt]
    Fixed[Static Baseline] -->|cannot switch| Stuck[TP1PP8 all day]

Limitations and Boundary Conditions

Memory requirement for shared weight store: The full unsharded model must fit in CPU host memory. Llama2-70B requires ~140 GB in BF16. On machines with less host RAM, this approach is infeasible. Modern serving clusters typically have 256–512 GB host RAM, but edge or resource-constrained deployments cannot use ReMP.

Topology candidates must be pre-specified: The MPU State Space pre-builds communication groups for a fixed set of candidate topologies. Adding a new topology at runtime requires rebuilding those groups (a slow operation). ReMP works best when the operator knows the 2–4 relevant configurations in advance.

vLLM v1 specific: ReMP deeply integrates with vLLM v1’s executor architecture. The techniques generalize, but porting to other engines (SGLang, TensorRT-LLM) requires similar deep integration work.

Scheduler quiescence latency: Step 1 of the transaction drains in-flight requests before reconfiguring. Under high load, some requests may queue for seconds before the drain completes, adding to effective switch latency.

KV cache migration correctness under failures: If a P2P transfer fails mid-migration, the KV cache is in an inconsistent state. The paper does not discuss partial-failure recovery (rollback to ToldT_\text{old}) — this gap could be significant in fault-prone cluster environments.

Critical Assessment: Weaknesses & Improvements

Weaknesses and Flaws

(a) Experimental scope is narrow. All experiments use Llama2-family models on 8-GPU servers. The paper does not test MoE models (Mixtral, DeepSeek), which have expert-parallel (EP) sharding that adds a third parallelism dimension. The claim that ReMP handles “mainstream LLM serving” is overstated given this gap — MoE models represent a large and growing fraction of production deployments.

(b) The diurnal workload is simulated, not from production traffic. The dynamic-workload experiments use a synthetic traffic pattern that follows a simple step function. Real production traffic is bursty, has multiple inflection points per day, and may require topology changes on sub-minute timescales. The paper does not benchmark the system under rapid, unpredictable traffic changes.

(c) No ablation on the parallel overlap savings. The paper claims the overlap of TkvT_\text{kv} and TmodelT_\text{model} reduces switching time, but no figure or table isolates this effect. We do not know how much of the speedup comes from the shared weight store vs. the overlap optimization vs. the pre-built MPU snapshots.

(d) The 7-second latency for 70B is still noticeable. For latency-sensitive SLO contracts (e.g., TTFT < 500 ms), even a 7-second reconfiguration pause is problematic. The paper does not discuss how in-flight requests are handled during the quiescence period — are they delayed, dropped, or buffered?

(e) Scheduler preemption side effects are not quantified. When the Scheduler Adapter preempts requests because their KV blocks are now invalid, those requests must redo their prefill computation. The paper does not report how many requests are preempted during typical topology switches, nor the resulting TTFT inflation for affected requests.

Limitations the Authors Understate or Omit

(a) Expert parallelism is completely absent. The related work mentions “TP and PP” throughout, and so does the implementation. Expert parallelism (used by DeepSeek-V3, Mixtral, Qwen-MoE) adds a third routing-dependent sharding that ReMP’s 2D KV migration does not address.

(b) Cross-machine topology changes are not addressed. All experiments use a single 8-GPU machine with NVLink. Production deployments span multiple nodes. Changing topology across a network interconnect (InfiniBand) is orders of magnitude slower for KV migration (PCIe or NIC bandwidth vs. NVLink), which would increase TkvT_\text{kv} significantly.

(c) The paper assumes a fixed total GPU budget. ReMP can switch (TP, PP) configurations but always uses the same set of physical GPUs. It cannot scale out (add new GPUs) or scale in (return GPUs to a shared pool) — a limitation for cloud-native elastic deployment.

Concrete Improvement Suggestions

(a) Extend to MoE with expert-parallel reconfiguration. The 2D migration (PP layer + TP head) could be extended to a 3D migration (PP layer + TP head + EP expert assignment). This is a natural and high-impact extension given the prevalence of MoE models.

(b) Add a predictive topology controller. ReMP reacts to workload changes. A simple time-series predictor (even a moving average) could trigger topology switches before a traffic peak, eliminating the transient degradation during the quiescence period.

(c) Characterize and minimize preemption impact. A careful ablation measuring preemption rate, affected request fraction, and TTFT inflation for preempted requests would quantify the true cost of reconfiguration on in-flight requests. This should be Table 2 in any revision.

(d) Cross-node migration with compression. For multi-node deployments, KV block migration could use weight-space compression (e.g., INT8) to reduce transfer volume, trading a small accuracy loss for much faster migration over slow network interconnects.

(e) Evaluate fault recovery mid-migration. The paper should describe and evaluate a checkpoint-and-rollback scheme for the KV migration phase to make the system production-safe under partial node failures.

Reproducibility Notes

  • Implementation: integrated into vLLM v1. No public code repository is linked in the paper at time of writing.
  • Hardware: requires at least 8 GPUs with peer-to-peer CUDA access (NVLink or PCIe P2P).
  • Memory: host RAM ≥ 1.5× full model size in BF16 (for 70B, ~210 GB).
  • The paper does not specify the NCCL version or CUDA version used; these can affect P2P transfer bandwidth significantly.

Production Traffic Analysis: The Quantitative Motivation

Before diving into the system design, it is worth understanding the quantitative scale of the problem ReMP addresses. The paper analyzes two real production systems.

INFINIGENCE AI service: Traffic at 20:00 is approximately 3.8× the traffic at 04:00. The service runs a model that requires 8 GPUs. At 04:00 off-peak, optimal configuration is TP4PP2 (high TP for low-latency single requests). At 20:00 peak, optimal configuration is TP1PP8 (high PP for throughput). Without dynamic reconfiguration, the operator must choose one configuration and accept:

  • TP4PP2 fixed: off-peak service is optimal, but peak throughput is ~40% below TP1PP8 optimal.
  • TP1PP8 fixed: peak throughput is optimal, but off-peak TTFT is ~3.1× worse than TP4PP2.

Zhejiang Lab service: Shows a similar diurnal pattern with ~4.2× peak-to-valley ratio. The optimal configuration again flips between high-TP (off-peak) and high-PP (peak).

The key observation is that the traffic patterns are stable across days — while individual days have noise, the average over multiple days is highly predictable. This makes the reconfiguration problem tractable: you can predict when to switch based on time-of-day rather than needing instantaneous traffic prediction.

Memory Overhead Analysis: Feasibility of the Shared Weight Store

The shared weight store is the most memory-expensive component of ReMP. Here is a breakdown for common models:

ModelParametersBF16 SizeFP32 SizeRecommended Host RAM
Llama2-7B7B~14 GB~28 GB≥ 21 GB (1.5×)
Llama2-13B13B~26 GB~52 GB≥ 39 GB (1.5×)
Llama2-70B70B~140 GB~280 GB≥ 210 GB (1.5×)
Llama3-405B405B~810 GBN/A (too large)≥ 1.2 TB (1.5×)

For Llama3-405B, the shared weight store alone requires over 1 TB of host RAM — a server-class requirement that many GPU clusters do not meet. This is a real scalability boundary for ReMP’s current design.

A practical mitigation is to use quantized weights in the shared store (e.g., INT8 or even INT4 for parameters that tolerate quantization) and dequantize before loading onto GPU. This would reduce the store by 2–4× at the cost of a dequantization step during shard reload.

Deep Dive: Why KV Cache Migration Is Hard

It is worth pausing to appreciate why KV cache migration is technically challenging, because this is where most of ReMP’s novelty lives.

The Consistency Problem

A KV cache block is not an opaque blob of bytes. It encodes semantic content — the key and value vectors for a specific set of tokens, computed by a specific layer, using a specific attention head’s projection. If you copy a block from GPU AA to GPU BB and GPU BB has a different TP rank or PP stage, the block’s head indices are wrong in the new context. A naive copy produces semantically incorrect attention — the model would attend over misaligned head slices and produce garbage outputs.

ReMP’s 2D mapping (Eqs. 5–8) ensures that each block arrives at the GPU that owns the correct (stage, TP rank) pair for its (layer, head) combination. This is the “semantic-aware migration” the paper refers to.

The Memory Efficiency Problem

A naive approach to KV migration would be:

  1. Allocate new KV memory for the new topology on all GPUs.
  2. Copy all blocks from old layout to new layout.
  3. Free old memory.

This requires holding both old and new KV layouts in GPU memory simultaneously — potentially doubling memory pressure. For a fully loaded server (KV cache consuming 40–50 GB of GPU memory), this would OOM most systems.

ReMP solves this with layer-by-layer streaming: for each layer \ell, copy its blocks, then immediately free the old GPU’s storage for that layer before moving to +1\ell+1. The peak extra memory is bounded by a single layer’s KV contribution:

ΔMpeak=2HkvdkvTactivesizeof(dtype)(11)\Delta M_\text{peak} = 2 \cdot H_\text{kv} \cdot d_\text{kv} \cdot T_\text{active} \cdot \text{sizeof}(\text{dtype}) \tag{11}

where TactiveT_\text{active} is the total active KV context across all live requests. For a 70B model with 8 KV heads, dkv=128d_\text{kv}=128, 10K tokens of context, BF16:

ΔMpeak=2×8×128×10000×240.96 MB per layer(12)\Delta M_\text{peak} = 2 \times 8 \times 128 \times 10000 \times 2 \approx 40.96 \text{ MB per layer} \tag{12}

This is negligible compared to total GPU memory, confirming that the layer-by-layer approach is memory-safe even on a fully loaded server.

The Ordering Problem

KV blocks are organized as PagedAttention pages. Each page is a (block_size × d_model) tensor associated with a logical sequence at a specific layer. When TP changes, the page boundary along the head dimension changes: what was a complete head slice under TP2 becomes only half a head’s worth of data in a TP4 shard.

ReMP handles this at the block-manager level: the Scheduler Adapter regenerates the block table’s head-slice offsets for the new TP configuration after migration completes. This is why the Scheduler Adapter must run after the KV migration engine finishes — the block table update and the physical data migration are coordinated in the same transaction.

Performance Analysis: When Does ReMP Help Most?

The speedup from ReMP over restart-based switching comes from three sources, each dominant in different regimes:

Source 1: Eliminating Checkpoint Disk I/O

Restart requires loading the checkpoint from disk (NVMe or distributed storage). Even fast NVMe achieves ~7 GB/s sequential read. For Llama2-70B (280 GB checkpoint with optimizer states, or 140 GB for inference weights in BF16), checkpoint loading alone takes 20–30 seconds under ideal conditions and much longer on network-attached storage.

ReMP’s shared weight store eliminates this entirely — CPU-to-GPU PCIe bandwidth (~32 GB/s for PCIe 4.0 ×16) is the new bottleneck for model reloading.

TdiskWBdisk,TsharedWshardBPCIe(13)T_\text{disk} \approx \frac{|W|}{B_\text{disk}}, \quad T_\text{shared} \approx \frac{|W_\text{shard}|}{B_\text{PCIe}} \tag{13}

where Wshard=W/(TP×PP)|W_\text{shard}| = |W| / (TP \times PP) and BPCIe4×BdiskB_\text{PCIe} \approx 4 \times B_\text{disk}. So even ignoring other restart costs, the shard-reload phase is ~4×TP×PP4 \times TP \times PP times faster.

Source 2: Eliminating NCCL Group Re-initialization

NCCL group creation (ncclCommInitAll) involves a broadcast-based handshake across all participating processes. For 8 GPUs, this typically takes 3–15 seconds depending on GPU interconnect topology and NCCL version. Pre-building groups in the MPU State Space eliminates this entirely — group activation is an O(1)O(1) pointer swap.

Source 3: KV Cache Preservation

Without migration, restart forces every in-flight request to redo prefill computation. For a long-context request (e.g., 32K prompt), prefill on a single H100 takes on the order of 1–5 seconds. If 50 such requests were in-flight, the “recomputation debt” is enormous.

ReMP migrates KV blocks in-place, so in-flight requests resume from where they left off. The only exception is preempted requests (whose blocks were on workers that changed world size) — but these are a small fraction of total requests in most transitions.

The TP vs PP Trade-off: A Quantitative View

To fully appreciate why dynamic topology switching matters, consider a concrete model of the TP/PP throughput trade-off.

Under Tensor Parallelism with degree rr, the communication overhead per transformer layer is one AllReduce of size 2d2d (column-then-row parallel). At batch size BB and sequence length TT, the effective throughput is approximately:

ThroughputTPBTTcompute(r)+Tallreduce(r)(14)\text{Throughput}_\text{TP} \approx \frac{B \cdot T}{T_\text{compute}(r) + T_\text{allreduce}(r)} \tag{14}

where Tcompute(r)1/rT_\text{compute}(r) \propto 1/r (linear speedup from TP), and Tallreduce(r)d(r1)/r1/BNVLinkT_\text{allreduce}(r) \propto d \cdot (r-1)/r \cdot 1/B_\text{NVLink} (AllReduce cost grows with rr). At small batch sizes (low load), TcomputeT_\text{compute} dominates, so more TP helps. At large batch sizes (high load), TallreduceT_\text{allreduce} becomes significant, making more TP harmful.

Under Pipeline Parallelism with degree pp, the pipeline bubble fraction is:

bubble fraction=p1p+m/p1(15)\text{bubble fraction} = \frac{p-1}{p + \lceil m/p \rceil - 1} \tag{15}

where mm is the number of micro-batches. At high concurrency (large mm), the bubble is negligible and PP primarily serves to partition memory. At low concurrency (small mm), the bubble waste is proportional to 1/m1/m, making PP inefficient.

This quantitative picture explains the data in Figure 2: at low load (mm small, BB small), TP dominates. At high load (mm large, BB large), PP dominates.

The Worker Lifecycle: A Closer Look at the Three-State Machine

The Worker Lifecycle Manager is subtler than it appears. Workers cannot simply be started and stopped because CUDA context initialization — allocating GPU memory, creating NCCL communicators, loading CUDA kernels — takes on the order of 30–60 seconds for large models.

ReMP avoids this by keeping workers alive even when they are not needed:

Active worker: running forward passes, owns GPU memory and NCCL connections
     ↓ (world size shrinks)                    ↑ (world size grows)
Standby worker: process alive, GPU memory still allocated,
                NCCL groups still initialized, but not receiving work
     ↑ "wakeup" message: sync ring index, rejoin active set

The wakeup procedure synchronizes the message-queue ring index so that standby workers can immediately receive new control messages and KV transfer requests without missing any messages that were sent while they were inactive. This ring-index sync is a very fast operation (a single shared-memory write) that completes in milliseconds.

The standby mechanism does have a cost: standby workers hold GPU memory even while inactive. If TP8PP1 is a candidate topology but the system spends most of its time in TP1PP8 (fewer active workers), the standby workers idle with their GPU memory allocated and unusable by other workloads. This is a form of memory reservation that reduces the effective utilization of the cluster.

For example, on an 8-GPU machine running TP2PP4 (8 active workers, no standbys), switching to TP4PP2 (still 8 workers, just different topology) has no standby cost. But switching from TP4PP2 to TP2PP2 (from 8 workers to 4 workers) leaves 4 standby workers holding their GPU memory reservation for when the system might switch back.

Handling World Size Changes: The Three Cases in Detail

The most complex coordination happens when the number of active workers changes. Here is a detailed walkthrough of each case.

Case 1: Same world size (e.g., TP2PP4 → TP4PP2, both 8 workers)

This is the simplest case. The same 8 GPU processes participate, just organized differently. Steps:

  1. Freeze scheduler, quiesce in-flight passes.
  2. Apply target MPU snapshot (NCCL groups are already built).
  3. Concurrently: (A) 2D KV migration, (B) reload model shards from shared store.
  4. Wait for both. Rebind scheduler.

There is no worker state change at all — only KV data and model shards move.

Case 2: Fewer workers (e.g., TP4PP2 → TP4PP1, from 8 to 4 workers)

Workers 4, 5, 6, 7 will become standby. But they currently hold KV cache for layers L/2L/2 through L1L-1 in their pipeline stage. The new TP4PP1 topology has only one pipeline stage (all 4 active workers), which must hold all layers. So layers L/2L/2 through L1L-1 must be migrated from workers 4–7 to workers 0–3 before workers 4–7 enter standby.

  1. Compute which layers workers 4–7 own that are needed by workers 0–3.
  2. Execute KV migration for those layers first.
  3. Workers 4–7 → standby (they can release their layer L/2L/2 through L1L-1 KV memory after migration).
  4. Workers 0–3 apply target MPU state, reload shards.
  5. Rebind scheduler.

Case 3: More workers (e.g., TP4PP1 → TP4PP2, from 4 to 8 workers)

Workers 4–7 are in standby. They need to join the active pipeline as stage 1.

  1. Wake workers 4–7: send wakeup message, sync ring index.
  2. Workers 4–7 allocate GPU memory for their new pipeline stage’s KV blocks.
  3. Apply target MPU snapshot (including PP groups that include workers 4–7).
  4. KV migration: transfer layers L/2L/2 through L1L-1 KV data from workers 0–3 → workers 4–7.
  5. Reload model shards: workers 4–7 load layers L/2L/2 through L1L-1 from shared store.
  6. Rebind scheduler.
SystemDynamic reconfigurationKV Cache preservationTP adjustmentPP adjustment
vLLM (baseline)❌ (restart)
LlumnixRequest migration (same topology)✅ (within topology)
SpotServeElasticity on preemptionPartial
PipeLivePipeline reconfigurationPartial✅ (PP only)
ReMP✅ TP+PP jointly✅ 2D migration

The key differentiator is that ReMP jointly reconfigures both TP and PP dimensions with KV cache preservation. PipeLive handles PP reconfiguration but not TP. Llumnix migrates requests between instances but does not change the serving topology. ReMP is the first system to treat the full TP×PP product as a runtime-adjustable parameter.

Integration Design: What ReMP Changes in vLLM

ReMP requires changes to four core subsystems of vLLM v1. Understanding the scope of these changes helps assess portability to other engines.

Executor: The executor is vLLM’s central control plane that dispatches inference steps to workers. ReMP adds a ReconfigurationController that intercepts the executor’s main loop, quiesces the scheduler, and manages the transaction lifecycle. This requires access to vLLM’s internal executor APIs — not a public interface, making the integration version-specific.

Worker management: vLLM v1 manages worker processes via a Ray-based distributed executor. ReMP extends the worker state machine with the standby/wakeup states, requiring modifications to vLLM’s worker spawn and process group management code.

KV cache manager: vLLM’s BlockSpaceManager tracks which KV blocks belong to which sequences. After a topology switch, block addresses change (different GPU, different offset within that GPU’s KV pool). ReMP extends the block manager to accept a “remapping” transaction that updates all block-to-sequence associations atomically.

Communication initialization: vLLM initializes NCCL groups once during startup via initialize_model_parallel(). ReMP replaces this with a multi-topology initialization that builds all candidate topology groups upfront and stores them in the MPU State Space.

For portability to SGLang or TensorRT-LLM, similar changes would be needed in each engine’s analogous subsystems. The conceptual design is general, but the implementation is tightly coupled to vLLM v1 internals.

Reproducibility Notes

  • Implementation: integrated into vLLM v1. No public code repository is linked in the paper at time of writing.
  • Hardware: requires at least 8 GPUs with peer-to-peer CUDA access (NVLink or PCIe P2P).
  • Memory: host RAM ≥ 1.5× full model size in BF16 (for 70B, ~210 GB).
  • The paper does not specify the NCCL version or CUDA version used; these can affect P2P transfer bandwidth significantly.
  • Replicating the dynamic workload experiments requires a traffic simulator calibrated to the specific diurnal patterns used.

Conclusion

ReMP solves a real and under-addressed problem: in production LLM serving, the TP/PP topology has always been a fixed configuration, even though optimal topology changes with load. By decoupling model weights, KV cache, communication groups, and worker processes from the topology, and by introducing two-dimensional KV cache migration with a critical-path-overlapping transaction protocol, ReMP achieves topology switches in seconds rather than minutes. The result is an adaptive serving system that can chase the optimal parallelism configuration across the diurnal cycle.

The memory-safety of layer-by-layer streaming migration, the O(1)O(1) topology snapshot swap, and the overlap of TkvT_\text{kv} with TmodelT_\text{model} are all elegant engineering choices that individually appear incremental but combine to make online reconfiguration practical. The main gaps — MoE expert-parallel support, cross-node migration, elastic GPU scaling — are natural next steps that the authors can pursue from this solid foundation.

For practitioners: ReMP is most impactful on high-traffic production deployments where traffic varies by 3× or more across the day, and where the model is large enough that the optimal TP/PP setting genuinely differs between peak and off-peak. For lightly-loaded systems or small models (where a single high-TP configuration already handles all traffic), the benefit is smaller.

Summary of Key Formulas

For reference, the main mathematical relations used throughout this note:

KV cache memory footprint (Eq. 1):

MKV=2LHkvdkvTBsizeof(dtype)M_\text{KV} = 2 \cdot L \cdot H_\text{kv} \cdot d_\text{kv} \cdot T \cdot B \cdot \text{sizeof}(\text{dtype})

TP weight sharding — Query projection split across TP degree rr (Eq. 3):

WQ=[WQ(0)WQ(1)WQ(r1)],WQ(i)Rd×(H/r)dkW_Q = [W_Q^{(0)} \mid W_Q^{(1)} \mid \cdots \mid W_Q^{(r-1)}], \quad W_Q^{(i)} \in \mathbb{R}^{d \times (H/r) \cdot d_k}

PP layer ownership for stage ss (Eq. 4):

layers(s)=[sL/p,  (s+1)L/p)\text{layers}(s) = \left[s \cdot \lfloor L/p \rfloor,\; (s+1) \cdot \lfloor L/p \rfloor\right)

Source GPU for KV block (layer \ell, head hh) under old topology (Eq. 5–6):

sold=L/PPold,rold=hHkv/TPold,gold=soldTPold+rolds_\text{old} = \left\lfloor \frac{\ell}{\lfloor L/PP_\text{old} \rfloor} \right\rfloor, \quad r_\text{old} = \left\lfloor \frac{h}{\lfloor H_\text{kv}/TP_\text{old} \rfloor} \right\rfloor, \quad g_\text{old} = s_\text{old} \cdot TP_\text{old} + r_\text{old}

Switching time with vs without parallel overlap (Eq. 9–10):

Tswitchseq=Tworker+Tmpu+Tkv+Tmodel+TschedT_\text{switch}^\text{seq} = T_\text{worker} + T_\text{mpu} + T_\text{kv} + T_\text{model} + T_\text{sched} Tswitchoverlap=Tworker+Tmpu+max(Tkv,Tmodel)+TschedT_\text{switch}^\text{overlap} = T_\text{worker} + T_\text{mpu} + \max(T_\text{kv}, T_\text{model}) + T_\text{sched}

Pipeline bubble fraction (Eq. 15):

bubble fraction=p1p+m/p1\text{bubble fraction} = \frac{p-1}{p + \lceil m/p \rceil - 1}

These formulas collectively capture the memory-compute-communication trade-offs that motivate and enable ReMP’s design. The KV migration mapping (Eqs. 5–8, the source-and-destination GPU derivation) is the mathematical heart of the system — the rest of ReMP is engineering infrastructure built to efficiently execute that mapping at runtime, without full service reconstruction, while staying memory-safe and semantically correct at every step of the transition.

The TP/PP product (r×pr \times p) sets the total GPU footprint. ReMP treats this footprint as constant but lets the distribution between rr and pp vary — a degree of freedom that prior systems did not expose. Making that degree of freedom accessible in production, with acceptable switching cost, is the paper’s lasting contribution.

As LLM serving matures and the mix of dense and MoE models diversifies, the generalization of ReMP’s approach to multi-dimensional parallelism (TP × PP × EP for MoE, TP × PP × SP for sequence parallelism) will become increasingly important. The 2D migration framework established here is the right foundation for that generalization.