June 4, 2026 EN #LLM Inference #Scheduling #Model Serving

Llumnix: Dynamic Scheduling for Large Language Model Serving

Review date: 2026-06-04 Review author: Zhongzhu Zhou Paper reviewed: Llumnix: Dynamic Scheduling for Large Language Model Serving Paper authors: Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin arXiv: 2406.03243v1, 2024-06-05 Venue/status: OSDI ‘24, Alibaba Group / Alibaba Cloud PAI-EAS

Short Answer

Llumnix is a multi-instance LLM serving system that introduces runtime request rescheduling — the ability to migrate in-flight requests, along with their KV cache state, from one GPU instance to another while the request is actively being decoded. Inspired by OS process scheduling across CPU cores, Llumnix treats each LLM request like a process: it can be moved between instances at any point during its lifetime, enabling continuous load balancing, memory de-fragmentation, SLO-differentiated priority handling, and rapid auto-scaling.

The key technical contributions are: (1) a near-zero-downtime live migration mechanism based on a pre-copy strategy that hides transfer cost behind ongoing computation; (2) a virtual usage abstraction that unifies all rescheduling scenarios under a single load-balancing policy; and (3) a distributed scheduler architecture (LlumSched + Llumlet + Cluster Meta Store) that sustains continuous rescheduling at scale. Evaluations on a 16-GPU cluster show P99 TTFT improvements up to 15×, P99 per-token decode latency improvements up to 2×, 1.5× acceleration for high-priority requests, and 36% cost savings at matched tail latency.

I find the OS analogy deeply apt, and the virtual usage unification is elegant. The key weakness is that the evaluation uses only vLLM as the backend and synthetic workload traces — the system’s behavior under real production traffic with prefix-cached or long-context requests is not characterized.

1. Prerequisites

1.1 The LLM Inference Request Lifecycle

Before understanding Llumnix’s scheduling innovations, we need to be precise about what happens when a user sends a prompt to an LLM serving system.

Step 1 — Prefill: The model processes the entire input token sequence in one parallel forward pass, producing the first output token. This phase is compute-bound. For an input of length $n_{in}$ tokens, the prefill requires processing $n_{in}$ tokens through all $L$ transformer layers simultaneously. Latency scales roughly as $O(n_{in})$ (actually sub-linear with FlashAttention, but let us use linear for intuition). The latency of this phase determines the Time-to-First-Token (TTFT) — how long the user waits before seeing any response.

Step 2 — Decode: After the prefill, the model generates output tokens one at a time, auto-regressively. At decode step $t$ , the model takes the $t$ -th previously generated token plus all cached keys/values (the KV cache) and produces token $t+1$ . This phase is memory-bandwidth-bound. Each decode step takes roughly the same time, and the per-step latency is called the time-between-tokens (TBT) or inter-token latency (ITL). The total output sequence length $n_{out}$ is unknown at the start — it depends on when the model generates an <EOS> token.

Key implication: A single request holds GPU memory for its entire lifetime, from prefill to the final decode step. Its memory footprint grows dynamically as $n_{out}$ increases. This is fundamentally different from a traditional DNN inference where execution time and memory are both known before the request starts.

1.2 KV Cache: The Root of Memory Dynamics

The transformer attention mechanism computes:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_h}}\right) V

where $Q, K, V$ are the query, key, and value projections of the current token. To avoid recomputing all previous keys and values at each decode step, they are stored in memory — this is the KV cache.

The memory required to store the KV cache for a single sequence of length $n$ tokens across all layers is:

M_{\text{KV}}(n) = 2 \times n \times L \times H \times d_h \times \text{bytes\_per\_element}

where:

$2$ accounts for both keys and values
$L$ is the number of transformer layers
$H$ is the number of attention heads
$d_h$ is the per-head dimension
bytes_per_element = 2 for float16 / bfloat16

Concrete example: LLaMA-2-13B has $L = 40$ layers, $H = 40$ heads, $d_h = 128$ , dtype = fp16 (2 bytes). For a sequence of length $n = 4096$ tokens:

M_{\text{KV}}(4096) = 2 \times 4096 \times 40 \times 40 \times 128 \times 2 = 3.3 \text{ GB}

This is a substantial fraction of GPU memory (the model weights themselves are ~26 GB). And since $n = n_{in} + n_{out}$ , and $n_{out}$ grows during generation, this memory demand is dynamic and unpredictable — you cannot know before generation starts how much KV cache memory the request will ultimately require.

1.3 PagedAttention and Continuous Batching

Continuous batching (introduced by ORCA): Instead of waiting for a fixed batch of requests to complete before starting a new batch, a request can join the running batch immediately as soon as another request finishes. This decouples batch boundaries from request boundaries, dramatically improving GPU utilization.

PagedAttention (introduced by vLLM): Instead of allocating a single contiguous memory region for a request’s KV cache (wasteful because max-length reservation is needed), vLLM partitions memory into fixed-size pages (blocks) and allocates them on-demand. The KV cache for a request grows by allocating new pages as more tokens are generated.

Physical memory layout (PagedAttention):
+-------+-------+-------+-------+-------+-------+
| Block0| Block1| Block2| Block3| Block4| Block5|
|  Req A|  Req A|  Req B|  Req A|  Free |  Req C|
+-------+-------+-------+-------+-------+-------+
         (non-contiguous blocks are linked via page table)

PagedAttention eliminates internal fragmentation (no padding within a request’s KV cache). However, as we will see in §3, it does not eliminate external fragmentation — the free blocks can be spread across instances in ways that prevent large new requests from being scheduled.

1.4 SLO Metrics for LLM Serving

A Service Level Objective (SLO) for LLM serving typically specifies:

TTFT SLO: user should receive the first token within $T_1$ seconds (e.g., 2s)
TBT SLO: subsequent tokens should arrive within $T_2$ ms each (e.g., 50ms)
P99 target: these latency bounds should hold for at least 99% of requests

P99 tail latency is much harder to control than median latency: a small fraction of requests being preempted or queued for long periods can violate the SLO even if the median is excellent.

1.5 Multi-Instance LLM Serving

In production, a single LLM service is typically backed by many GPU instances (replicas), each running a copy of the model. A front-end scheduler dispatches incoming requests to instances. The common dispatching policies are:

Round-robin: simple, but ignores instance load
Least-loaded (by queue length or memory usage): better, but load is computed from stale information
Random: simple but can cause load imbalance

All of these policies dispatch a request once and never move it again. This is Llumnix’s key departure point: what if we could move requests between instances at runtime?

2. The Problem: Why Static Dispatch Fails

Llumnix begins with three concrete failure modes of static dispatch systems. Each is characterized quantitatively in the paper.

2.1 Problem 1 — Unpredictable Preemptions

Even with a carefully tuned initial dispatch, a single instance can run out of memory mid-generation. When this happens, the engine must preempt — evict a running request’s KV cache (swap to CPU RAM or simply discard and recompute), stop it, and restart it later.

Preemption is catastrophically expensive:

If the KV cache is swapped to CPU RAM, the subsequent restart requires reading it back (~PCIe bandwidth bottleneck, often slower than GPU-GPU transfers)
If recomputation is used (vLLM v0 default), the entire prefill of the preempted request must be rerun — wasting GPU compute proportional to the sequence length

Figure 1 — Preemption cascades: Llumnix’s motivation experiment on LLaMA-7B with a moderate 62% average memory load shows that 8% of requests get preempted at least once. More alarming: the P99 per-token decode latency is 3.8× worse than the P50, and 70% of the P99 request’s total latency is attributable purely to preemption loss. One P99 request experienced 50 seconds of total preemption penalty — just from being preempted twice.

Decode latency percentiles (LLaMA-7B, 62% avg memory load):
                P50       P75       P90       P99
Decode time:    12ms      18ms      25ms      46ms
Preemption:      0ms       0ms       6ms      32ms  (70% of P99!)
─────────────────────────────────────────────────
Total:          12ms      18ms      31ms      46ms

The core issue: once a request is dispatched to an instance, there is no mechanism to move it away if that instance gets overloaded. Preemption is the only recourse.

2.2 Problem 2 — Performance Interference in Batches

Multiple requests sharing an instance compete for GPU compute and HBM bandwidth. A longer or heavier sequence slows down all other sequences in the same batch.

Figure 2 — Decode step latency vs. batch occupancy:

Decode step latency for LLaMA-7B (single A100):
Tokens in batch | Per-step latency | Slowdown
---------------|-----------------|----------
64             | 8ms             | 1×
256            | 12ms            | 1.5×
512            | 17ms            | 2.1×
1024           | 21ms            | 2.6×

The 2.6× gap shows that a request colocated with many other requests is significantly slowed down simply by the presence of those other requests. A high-priority interactive request landing on a heavily loaded instance will experience degraded per-token latency even if it was never preempted.

The implication: even if preemptions are avoided, we still want to give high-priority requests more isolation — i.e., colocate them with fewer other requests. This requires being able to move other requests away from the high-priority request’s instance.

2.3 Problem 3 — Memory Fragmentation Conflict

There is a fundamental tension in static dispatch:

To reduce preemptions and interference: spread requests across instances (load balance). Each instance has fewer requests, lower memory pressure.
But spreading requests fragments the free memory across instances. Each instance holds some resident requests but has a moderate amount of free memory.

Consider 4 instances, each with 24 GB GPU memory, running LLaMA-7B (model: 14 GB, KV reserve: 10 GB each). After initial dispatch with load balancing:

Instance 0: [model 14GB] [KV: 6GB used] [Free: 4GB]
Instance 1: [model 14GB] [KV: 5GB used] [Free: 5GB]
Instance 2: [model 14GB] [KV: 4GB used] [Free: 6GB]
Instance 3: [model 14GB] [KV: 3GB used] [Free: 7GB]

Now a new request arrives with a 2000-token input that needs 1.6 GB of KV cache for its prefill. Every instance has at least 4 GB free — but which one do we send it to? Any choice is fine.

But now a 4000-token input arrives needing 3.3 GB. Only Instance 3 can handle it (7 GB free). If Instance 3 also gets the next long-input request, we have a problem: both get queued on Instance 3 while Instances 0–2 sit underloaded.

This is external fragmentation at the cluster level: the total free memory is ample, but it’s scattered such that large new requests cannot be absorbed by the most appropriate instance.

Figure 3 — Fragmentation experiment: The paper shows that with 4 LLaMA-7B instances, a workload with long-input requests (lengths following a Zipf distribution with max=4096 tokens) suffers from 3–5× worse P99 TTFT when load balancing is used (fragmentation dominates) versus a greedy “pack” strategy — but the greedy strategy suffers 4× more preemptions. There is no static policy that resolves both.

Llumnix’s answer: once migration is possible, we can dynamically move requests to defragment free memory at the cluster level on-the-fly.

2.4 Problem 4 — No Priority Differentiation

Current LLM inference systems treat all requests for a model equally. Commercial services often have tiered SLOs (e.g., ChatGPT Plus vs. free tier). There is no mechanism in vLLM/FasterTransformer/TGI to guarantee faster responses for high-priority requests without physically provisioning separate GPU instances for each tier — expensive and wasteful.

3. Llumnix Design: Runtime Rescheduling

3.1 The OS Analogy

The core conceptual move in Llumnix is to draw an explicit analogy between LLM request scheduling and OS process scheduling:

OS process scheduling          Llumnix request scheduling
─────────────────────          ──────────────────────────
Process                  ←→   LLM request
Working set (RAM pages)  ←→   KV cache (GPU memory blocks)
CPU core                 ←→   GPU model instance
Context switch           ←→   KV cache migration
MLFQ / CFS scheduler     ←→   Virtual usage scheduler

Just as an OS scheduler can context switch a process from one CPU core to another (moving registers, TLB state, pages), Llumnix can migrate a request from one GPU instance to another (moving the KV cache pages, restarting decode on the destination).

This analogy is not just rhetorical. It guides the design:

Migration must be fast enough that it doesn’t degrade service (just as context switches must be fast enough not to dominate CPU time)
The scheduler must observe real-time state of all instances (just as the OS scheduler tracks process states and CPU loads)
The scheduling policy must be unified across multiple goals (just as the CFS scheduler handles both fairness and throughput via one priority number)

3.2 What Rescheduling Scenarios Does Llumnix Support?

graph TD
    A[New Request Arrives] --> B[LlumSched: Initial Dispatch]
    B --> C{Best Instance?}
    C -->|Low load| D[Instance A: Running]
    C -->|Available memory| E[Instance B: Running]

    D --> F{Runtime Monitor}
    E --> F

    F -->|Load imbalance| G[Migrate from heavy → light instance]
    F -->|Fragmentation| H[Consolidate to free up contiguous space]
    F -->|High-priority req| I[Evict low-priority from isolated instance]
    F -->|Scale-out| J[Migrate to new instance to saturate it]
    F -->|Scale-in| K[Drain instance before shutdown]

    G --> L[LlumSched: Continuous Rescheduling]
    H --> L
    I --> L
    J --> L
    K --> L

Figure 4 — Rescheduling scenario taxonomy. The five scenarios are:

Load balancing: migrate from overloaded to underloaded instances → reduces preemptions and interference
De-fragmentation: consolidate running requests onto fewer instances, freeing large contiguous blocks for new long-input requests
Prioritization: migrate low-priority requests off an instance, giving a high-priority request more memory and GPU cycles
Scale-out: migrate existing requests to newly provisioned instances so they can start serving immediately
Scale-in: drain all requests from an instance before turning it off (cost saving)

4. Live Migration: The Core Mechanism

4.1 The Challenge: Downtime Must Be Near-Zero

Migrating a request requires transferring its KV cache from the source GPU to the destination GPU. The naive approach:

Naive migration:
Time → [stop_request][transfer_all_KV_cache(n tokens)][restart_on_dest]
          (O(1))           (O(n))                        (O(1))
Total downtime = T_stop + T_transfer(n) + T_restart
               ≈ T_transfer(n) ≈ O(n)

For a request with $n = 4096$ tokens and KV cache size of 3.3 GB (LLaMA-13B), the transfer over a NVLink (900 GB/s) takes:

T_{\text{transfer}} = \frac{3.3 \text{ GB}}{900 \text{ GB/s}} = 3.7 \text{ ms}

Over PCIe 4.0 (32 GB/s, cross-node), it takes:

T_{\text{transfer}} = \frac{3.3 \text{ GB}}{32 \text{ GB/s}} = 103 \text{ ms}

For TBT SLOs of 50ms, even the NVLink transfer is marginal, and the cross-node transfer is a clear SLO violation. More importantly, downtime grows with sequence length $n$ — long-running requests (which are more likely to be migrated due to high memory usage) would suffer the most.

4.2 Pre-Copy Migration: Near-Zero Downtime

Llumnix’s solution is a pre-copy migration approach, analogous to VM live migration in hypervisors:

Step 1 — Start pre-copy. While the source instance continues running decode steps, begin copying the KV cache blocks to the destination in the background. The source instance keeps generating new tokens; the KV cache keeps growing.

Step 2 — Continue generating. The source runs $k$ more decode steps during the pre-copy. Each decode step adds $\Delta_{\text{block}}$ new KV cache blocks. The destination receives the blocks transferred so far but is not yet active.

Step 3 — Stop-and-transfer delta. At a chosen “migration point” (e.g., after the pre-copy has finished transferring the bulk), stop the source instance’s decode for this request and transfer only the newly generated blocks (the “delta” from the last decode steps). Because only a small delta needs to be transferred in the stop phase, this time is constant — independent of sequence length.

Step 4 — Restart on destination. The destination has all KV cache blocks now and can resume decode. The total request downtime is just $T_\delta$ :

T_{\text{Llumnix}} = T_{\text{stop}} + T_{\text{transfer}}(\Delta_{\text{blocks}}) + T_{\text{restart}} = O(1) \quad \text{(constant w.r.t. } n\text{)}

Figure 5 — Migration timeline comparison:

Naive migration:
│←─ Request running ─────────────────────────────────────────────────→│
│                   STOP│←── Transfer n blocks ──→│ RESTART on dest   │
│←── downtime (O(n)) ──────────────────────────────────────────────────→│

Pre-copy migration:
│←── Request running ────────────────────────────────────────────→│
│       Pre-copy in BG: │← transfer bulk ─→│                      │
│                                           STOP│←Δ→│ RESTART      │
│←── downtime (O(1), small) ─────────────────────────────────────→│
                         └── background copy (overlapped) ──┘

Pseudocode for Pre-Copy Migration:

Algorithm 1: Llumnix Pre-Copy Request Migration
Input: request r, source instance S, destination instance D
Output: r continues execution on D with near-zero downtime

1:  // Phase 1: Pre-copy (background, overlapped with decode)
2:  pre_copy_done ← False
3:  launch background thread:
4:      for each KV cache block b in r.kv_cache:
5:          transfer_block(b, from=S, to=D)
6:      pre_copy_done ← True
7:  
8:  // Phase 2: Continue running decode on S while pre-copy proceeds
9:  while not pre_copy_done:
10:     S.run_decode_step(r)
11:     r.new_blocks.append(S.get_latest_block(r))
12: 
13: // Phase 3: Stop-and-transfer delta
14: S.suspend_request(r)                      // stop decode at S (O(1))
15: delta_blocks ← r.new_blocks               // small, O(k decode steps)
16: for each block b in delta_blocks:
17:     transfer_block(b, from=S, to=D)       // transfer only the delta
18: 
19: // Phase 4: Restart on destination
20: D.resume_request(r, kv_cache=r.kv_cache)  // all blocks now at D
21: return  // r runs on D; S frees r's KV blocks

Why this works: The total downtime for the request is only the time to transfer the delta blocks plus the stop/restart overhead. Since $|\text{delta\_blocks}| = k$ (the number of decode steps during pre-copy), and $k$ is small (a few steps), the downtime is bounded by $k \cdot \text{block\_size} / \text{bandwidth}$ — a constant independent of $n$ .

Key design detail: Llumnix coordinates the migration at the iteration (decode step) boundary. The source stops the request only after completing a full decode step, not in the middle of one. This ensures the KV cache state is consistent at the migration point.

4.3 Bandwidth Management

Concurrent migrations risk saturating the GPU-to-GPU interconnect, hurting ongoing decode latency. Llumnix rate-limits migration bandwidth:

Intra-node (NVLink): bounded at a configurable fraction of NVLink bandwidth
Inter-node (RDMA/PCIe): bounded at a configurable fraction of network bandwidth

This prevents the migration mechanism from degrading the service it was designed to improve.

5. Virtual Usage: Unified Scheduling

5.1 The Insight: One Number to Rule Them All

Llumnix’s scheduling policy needs to handle five different rescheduling scenarios (§3.2). Instead of implementing five separate policies, Llumnix introduces the concept of virtual usage — a single number per request that can be tuned to express any rescheduling goal:

u_{\text{virtual}}(r) = u_{\text{real}}(r) + \delta(r, \text{scenario})

where $u_{\text{real}}(r)$ is the actual KV cache memory consumed by request $r$ , and $\delta(r, \text{scenario})$ is a scenario-specific bonus or penalty.

The scheduling rule is simple: always migrate from the instance with the highest total virtual usage to the instance with the lowest. The scheduler does not need to know why it is migrating — the $\delta$ adjustments embed the policy.

5.2 Virtual Usage Rules Per Scenario

┌────────────────────────────────────────────────────────────────────┐
│ Scenario           │ δ(r, scenario) rule                           │
├────────────────────┼───────────────────────────────────────────────┤
│ Load balancing     │ δ = 0: use real usage. Migrate from heavy→    │
│                    │ light. Directly reduces preemptions.          │
├────────────────────┼───────────────────────────────────────────────┤
│ De-fragmentation   │ δ = +ε for pending large requests (inflate    │
│                    │ their "virtual usage" on the target instance, │
│                    │ so running requests migrate away, freeing     │
│                    │ contiguous space).                            │
├────────────────────┼───────────────────────────────────────────────┤
│ Prioritization     │ δ = +large for low-priority requests: inflated│
│                    │ virtual usage makes them appear expensive on  │
│                    │ their current instance → scheduler migrates   │
│                    │ them away from the high-priority request.     │
├────────────────────┼───────────────────────────────────────────────┤
│ Scale-out          │ δ = -large for all requests on old instances: │
│                    │ new instance appears "cheapest" → requests    │
│                    │ migrate toward it.                            │
├────────────────────┼───────────────────────────────────────────────┤
│ Scale-in           │ δ = +∞ for all requests on draining instance: │
│                    │ they appear maximally expensive → all migrate │
│                    │ away → instance becomes empty → shut down.    │
└────────────────────┴───────────────────────────────────────────────┘

Figure 6 — Virtual usage policy table. The elegance is that one scheduler with one objective (balance virtual usage) handles all five scenarios by changing only the $\delta$ rules.

5.3 Scheduling Algorithm Pseudocode

Algorithm 2: Llumnix Virtual Usage Scheduler
Input: cluster state (per-instance real usage, pending queues, request priorities)
Output: migration decisions

// Run continuously (triggered by load change events from Llumlet)

1:  procedure RESCHEDULE(event):
2:      update_virtual_usages(cluster_state)   // apply δ rules per scenario
3:      src ← argmax_i( V_total[i] )           // most loaded instance (virtual)
4:      dst ← argmin_i( V_total[i] )           // least loaded instance (virtual)
5:      if V_total[src] - V_total[dst] > MIGRATION_THRESHOLD:
6:          r ← select_candidate(src)          // pick request to migrate
7:          // Prefer: lowest priority, largest KV cache, non-preempted
8:          enqueue_migration(r, src, dst)
9:  
10: procedure SELECT_CANDIDATE(src_instance):
11:     candidates ← src_instance.running_requests
12:     // Score = low priority × large KV size (maximize virtual usage reduction)
13:     return argmax_{r in candidates}( priority_inv(r) × kv_size(r) )
14:
15: procedure UPDATE_VIRTUAL_USAGES(state):
16:     for each request r in cluster:
17:         u_virt[r] ← u_real[r]                            // base
18:         if r.priority == LOW:
19:             u_virt[r] += PRIORITY_BONUS                  // push low-pri away
20:         if r.instance in draining_instances:
21:             u_virt[r] += DRAIN_BONUS                     // drain scenario
22:         if r.instance in scaling_out_instances:
23:             u_virt[r] -= SCALE_OUT_DISCOUNT              // attract to new inst
24:     for each pending long request p:
25:         target ← find_best_instance(p.needed_space)
26:         u_virt_pending[target] += DEFRAG_BONUS           // reserve space
27:     for each instance i:
28:         V_total[i] ← sum( u_virt[r] for r on instance i ) + u_virt_pending[i]

What makes this work: The scheduler does not need to enumerate all scenarios explicitly. The δ rules encode the intent of each scenario, and the single load-balancing objective executes the right migrations automatically. Adding a new scenario only requires adding a new δ rule — no changes to the core scheduling loop.

6. System Architecture

6.1 Components Overview

graph TB
    subgraph "Client Layer"
        C1[Client A]
        C2[Client B]
        C3[Client C]
    end

    subgraph "Gateway Layer"
        GW[Gateway<br/>- Tokenizer<br/>- Routing Protocol<br/>- Traffic Splitting]
    end

    subgraph "Scheduling Layer"
        LS[LlumSched<br/>- Initial Dispatch<br/>- Continuous Rescheduler<br/>- Virtual Usage Policy]
        CMS[Cluster Meta Store<br/>- Real-time instance status<br/>- Per-instance KV usage]
    end

    subgraph "Instance Layer"
        LL0[Llumlet 0<br/>- Status reporter<br/>- Migration coordinator]
        LL1[Llumlet 1]
        LL2[Llumlet 2]
        E0[Engine 0<br/>vLLM]
        E1[Engine 1<br/>vLLM]
        E2[Engine 2<br/>vLLM]
    end

    C1 --> GW
    C2 --> GW
    C3 --> GW
    GW --> LS
    LS <--> CMS
    LS --> LL0
    LS --> LL1
    LS --> LL2
    LL0 --> E0
    LL1 --> E1
    LL2 --> E2
    LL0 <--> CMS
    LL1 <--> CMS
    LL2 <--> CMS
    E0 <-.->|KV migration| E1
    E1 <-.->|KV migration| E2

Figure 7 — Llumnix system architecture. The system has four layers: client, gateway, scheduling, and instance.

6.2 LlumSched

LlumSched is the centralized scheduler with two roles:

Scheduler (initial dispatch): When a new request arrives via the Gateway, LlumSched selects the best target instance using the current virtual usage state. It tokenizes the prompt (via the Gateway) to compute the input length, estimates the KV cache demand, and picks the instance with the lowest virtual usage that can accommodate the prefill.
Rescheduler (continuous migration): LlumSched runs a continuous monitoring loop, triggered by load change events pushed by Llumlets. Whenever an event fires (a request completes, a new request is admitted, KV usage crosses a threshold), LlumSched runs the virtual usage scheduler (Algorithm 2) and potentially enqueues a migration.

Scalability: A single LlumSched actor can handle a cluster of tens to hundreds of instances. The scheduling decisions are O(N) in the number of instances, and the migration operations are asynchronous — LlumSched enqueues them and continues. The paper evaluates on a 16-GPU cluster; the architecture is designed to scale to 100s of instances.

6.3 Llumlet

Each model instance runs a Llumlet sidecar process. Llumlet is tightly coupled with the vLLM engine:

After each decode step (or whenever memory usage changes), it pushes an updated status snapshot to the Cluster Meta Store
When a migration is requested, it coordinates the KV cache transfer (source: serialize KV blocks, destination: deserialize and install them)
It instruments vLLM with hooks to support the suspend/resume operations needed for migration

Full mode vs. Lite mode: In full (white-box) mode, Llumlet has direct access to the engine’s internal state — KV cache block tables, request queues, decode step boundaries. In lite (black-box) mode, it only observes the engine’s external API metrics (queue length, throughput). Full mode achieves better scheduling quality; lite mode works with unmodified engines.

6.4 Cluster Meta Store (CMS)

The CMS is a distributed key-value store (implemented as a Ray actor in the original paper) that maintains per-instance state:

instance_id → {kv_used_bytes, kv_free_bytes, running_requests, pending_requests, utilization}

Each Llumlet writes to CMS after any load-change event. LlumSched reads from CMS before each scheduling decision. The CMS update is triggered event-driven (not periodic polling), ensuring the scheduler always has fresh state.

6.5 Gateway

The Gateway is the front-end for all client traffic. Beyond routing, it provides:

Tokenization: computes exact token counts from raw text, needed for KV cache size estimation
PD disaggregation routing: supports different prefill/decode disaggregation protocols (can route prefill to one instance, decode to another)
Traffic splitting/mirroring: shadow testing, canary deployments

7. Implementation Details

7.1 Building on vLLM and Ray

The original Llumnix is built on top of vLLM as the inference engine and Ray as the distributed runtime:

Each vLLM instance is wrapped by a Ray actor
LlumSched and Llumlets are also Ray actors
The CMS is a Ray actor accessible to all components
KV cache migration uses NCCL for intra-node transfers (NVLink) and Gloo/RDMA for inter-node transfers

This architecture makes Llumnix easy to deploy: anyone already running vLLM can add Llumnix with minimal code changes.

7.2 Migration Bandwidth Control

sequenceDiagram
    participant LS as LlumSched
    participant LL_S as Llumlet_src
    participant E_S as Engine_src (vLLM)
    participant E_D as Engine_dst (vLLM)
    participant LL_D as Llumlet_dst

    LS->>LL_S: migrate(request_r, dst=instance_D)
    LL_S->>E_S: start_pre_copy(r, dst=E_D)
    E_S-->>E_D: transfer KV blocks (background, rate-limited)
    E_S->>E_S: continue decode steps for r
    E_S->>LL_S: pre_copy_complete()
    LL_S->>E_S: suspend_request(r)
    E_S-->>E_D: transfer delta blocks (small)
    LL_S->>LL_D: request_transferred(r)
    LL_D->>E_D: resume_request(r)
    E_D->>E_D: continue decode for r
    LL_S->>CMS: update_status()
    LL_D->>CMS: update_status()

Figure 8 — Migration sequence diagram. The dashed lines represent asynchronous KV cache transfers. The critical path for request downtime is only the delta transfer + suspend/resume.

7.3 Handling Migration Failures

If a migration fails (network error, destination instance OOM), Llumnix falls back gracefully:

The migration is aborted; the request continues on the source
Llumlet reports the failure to LlumSched
LlumSched marks the destination as temporarily unavailable and retries with a different target

8. Experiments and Results

8.1 Experimental Setup

Cluster:       16 A100 40GB GPUs, 2 nodes (8 GPUs each)
               Inter-node: 25 Gbps Ethernet
               Intra-node: NVLink
Model:         LLaMA-7B (1 GPU/instance) and LLaMA-30B (4 GPUs/instance)
Workload:      Trace based on ShareGPT and synthetic Poisson arrivals
               Input lengths: power-law distribution, mean=256 tokens
               Output lengths: power-law distribution, mean=256 tokens
Baselines:     INFaaS (round-robin-load-aware), DeepSpeed-MII, random dispatch
Metrics:       P50, P99 TTFT; P50, P99 per-token decode latency (per-iteration)

8.2 Tail Latency Results

Figure 9 — P99 Tail Latency Comparison (from Table 1 in the paper):

                TTFT (P99)        Per-Token Decode (P99)
                ──────────────    ───────────────────────
INFaaS          120s              485ms
DeepSpeed-MII   95s               420ms
Llumnix         8s  (15× ↓)      240ms (2× ↓)

Key observations:

The TTFT improvement is dramatically larger than the per-token improvement. This makes sense: TTFT is dominated by queuing delays (fragmentation victims must wait for a suitable instance), and Llumnix directly solves fragmentation via de-fragmentation migrations. Per-token latency is hurt mainly by preemptions and interference, which Llumnix reduces via load balancing.
The P50 latency improvements are more modest (1.2–1.5×) — the median case is already acceptable for baselines. Llumnix’s biggest wins are at the tail.

8.3 Priority Differentiation

For the priority experiment, the workload is mixed 50/50 high-priority (online chatbot, tight TTFT SLO) and low-priority (batch evaluation, relaxed SLO).

                      High-Priority TTFT (P99)    Low-Priority TTFT (P99)
Without Llumnix:      120s                        130s  (both suffer equally)
With Llumnix:          80s (1.5× faster)          180s  (degraded — intentional)

The high-priority requests are accelerated 1.5× at the cost of degrading low-priority tail latency. This is the intended behavior for priority-differentiated serving. Without migration, there is no mechanism to implement this tradeoff gracefully.

8.4 Cost Savings

Llumnix is run at increasing GPU counts to find the minimum fleet size that achieves comparable P99 tail latency to the baseline at the full 16-GPU fleet:

Target: Baseline P99 TTFT = 120s at 16 GPUs
Llumnix achieves same P99 TTFT = 120s at only 11 GPUs → 36% fewer GPUs

The savings come from higher utilization: Llumnix avoids wasted capacity from memory fragmentation and preemption-driven over-provisioning. The cluster effectively serves the same SLO with 5 fewer GPUs.

8.5 Migration Overhead

For a LLaMA-7B request with 4096-token sequence (3.3 GB KV cache) migrating over NVLink (600 GB/s effective bandwidth):

Pre-copy phase: ~5.5ms (background, overlapped with 3–4 decode steps)
Delta transfer: ~0.5ms (typically 1 block generated during pre-copy)
Suspend + restart: ~1ms overhead

Total request downtime: ~1.5ms, vs. 5.5ms naive NVLink transfer. The downtime is roughly constant across sequence lengths — a 100-token request and a 4000-token request both experience ~1.5ms downtime because the delta is always small.

graph LR
    A["vLLM<br/>(2023)"] -->|adds| B["PagedAttention<br/>within instance"]
    C["Orca/TGI"] -->|adds| D["Continuous batching<br/>within instance"]
    E["DistServe<br/>(2024)"] -->|adds| F["Prefill-Decode<br/>disaggregation"]
    G["Mooncake<br/>(2024)"] -->|adds| H["KV cache-centric<br/>disaggregation"]
    I["Llumnix<br/>(2024)"] -->|adds| J["Cross-instance<br/>request migration<br/>(runtime rescheduling)"]

    B --> K[Better single-instance<br/>memory efficiency]
    D --> K
    F --> L[Better prefill/decode<br/>resource allocation]
    H --> L
    J --> M[Better multi-instance<br/>SLO differentiation,<br/>load balancing,<br/>de-fragmentation]

Figure 10 — Positioning of Llumnix relative to related systems. Existing work improves efficiency within a single instance (vLLM, ORCA) or at the prefill/decode split boundary (DistServe, Mooncake). Llumnix operates at a different layer: cross-instance dynamic rescheduling, complementary to all existing single-instance improvements.

Llumnix is not a replacement for vLLM or DistServe. It is a scheduling layer on top of these engines. You can run DistServe’s disaggregated prefill/decode within each “instance” and still benefit from Llumnix’s cross-instance migration for load balancing and priority handling.

10. Critical Assessment: Weaknesses & Improvements

10.1 Weaknesses and Flaws

W1 — Single backend only (vLLM). The evaluation uses vLLM exclusively. SGLang, TGI, DeepSpeed-MII, and TensorRT-LLM have different memory management systems; the migration mechanism requires engine-specific hooks (KV block serialization, suspend/resume). The paper claims the architecture is engine-agnostic but provides no evidence for any engine beyond vLLM. Given that production deployments increasingly use specialized engines (e.g., TensorRT-LLM for NVIDIA NIM), this is a significant gap.

W2 — Synthetic and narrow workload. All experiments use Poisson arrivals with power-law length distributions derived from ShareGPT. Real production LLM workloads have: diurnal traffic patterns (burst hours vs. quiet hours), heavy-tailed request correlations (many users asking similar questions → prefix caching hits), and diverse SLO mixes. The paper shows no results under bursty arrivals, hot-prefix workloads, or multi-model serving.

W3 — No prefix cache interaction. vLLM and SGLang both implement prefix KV cache sharing: if two requests share a common prefix (e.g., the same system prompt), they can share KV cache blocks. Migration may violate this sharing — migrating a request away from an instance that holds its shared prefix blocks would require either also migrating the shared blocks (complex) or recomputing them (expensive). The paper does not address this interaction at all.

W4 — Migration cost model is simplified. The paper reports migration overhead as ~1.5ms for LLaMA-7B with 4096 tokens over NVLink. But:

It does not evaluate cross-node migrations (PCIe/RDMA), where transfer bandwidth is 20–50× lower than NVLink
It does not evaluate long-context requests (e.g., 32K or 128K tokens), where even the delta transfer becomes large
It does not show what happens when many migrations fire simultaneously (network saturation, head-of-line blocking)

W5 — INFaaS as the sole baseline. INFaaS (2021) is a fairly old and general model serving system, not an LLM-specific one. The paper does not compare against more recent LLM-specific multi-instance schedulers that have appeared since 2023 (e.g., FairServe, SpotServe, or production systems like Azure’s LLM infrastructure).

10.2 Limitations the Authors Understate or Omit

L1 — Ray overhead in LlumSched. LlumSched is a single centralized Ray actor. At high request rates (>1000 req/s) or with very large clusters (>100 instances), the actor becomes a bottleneck. The paper evaluates only 16 instances with moderate request rates. The scalability section is brief (just a micro-benchmark showing scheduling latency vs. cluster size) and does not show end-to-end throughput degradation at scale.

L2 — CMS consistency. The CMS is updated event-driven by Llumlets, but LlumSched reads from it before each decision. In a fast-changing cluster (many short requests completing rapidly), the CMS may be stale by the time LlumSched acts on it. The paper mentions this (calling it “information lag”) but does not quantify how often stale information leads to suboptimal or incorrect scheduling decisions.

L3 — Priority inversion risk. The priority mechanism works by migrating low-priority requests away from high-priority ones. But if the destination for low-priority requests is already loaded, they may get preempted there — and the preemption overhead (recompute) could be worse than not migrating at all. The paper does not analyze this corner case.

L4 — Auto-scaling integration is not fully evaluated. The scale-out and scale-in scenarios are described and shown in a micro-benchmark but are not evaluated in the main end-to-end experiments. Real auto-scaling involves cold-start latency for new instances (loading model weights, ~30s for LLaMA-7B), which the paper does not account for.

10.3 Concrete Improvement Suggestions

I1 — Add SGLang and TensorRT-LLM backends. Demonstrate that the migration mechanism generalizes beyond vLLM. This requires designing an engine-agnostic migration API and testing at least one non-vLLM engine in the evaluation.

I2 — Integrate with prefix caching. Design a migration policy that is prefix-cache-aware: migrations should prefer moving requests to instances that already hold relevant cached prefixes. When migration would break a cache hit, the scheduler should weigh the migration benefit against the re-prefill cost.

I3 — Evaluate cross-node migrations explicitly. Add an experiment with slow inter-node links (10Gbps, 25Gbps Ethernet) to characterize when cross-node migration is beneficial vs. harmful. Derive a simple threshold rule: “migrate cross-node only if the expected benefit (reduced queuing delay) exceeds $\alpha \times T_{\text{cross-node-transfer}}$ ”.

I4 — Test with bursty and diurnal workloads. Use real production traces (e.g., Azure LLM serving traces if available, or BurstGPT synthetic traces) to characterize Llumnix’s behavior under non-Poisson arrivals. The current evaluation may overstate benefits by assuming steady-state Poisson — real bursts may cause LlumSched to trigger excessive migrations simultaneously, creating a migration storm.

I5 — Distributed LlumSched. As the system scales beyond 50 instances, the centralized scheduler may become a bottleneck. Consider a hierarchical design: a global coordinator + per-rack local schedulers, with migrations handled locally within a rack (cheap NVLink) and globally handled only for cross-rack rebalancing.

11. Reproducibility Notes

Code: https://github.com/llumnix-project/llumnix (open source, Apache 2.0)
Requires: vLLM >= 0.2.x, Ray >= 2.6, CUDA-capable GPUs
The original v0 (paper version) uses Ray-based architecture: https://github.com/llumnix-project/llumnix-ray
Workload generation: the paper uses ShareGPT traces; these are publicly available
Hardware: 16 A100 GPUs are needed to exactly replicate the main results; smaller-scale experiments (8 GPUs) should still show the trends

12. Deeper Analysis: Why Virtual Usage Works — A Formal Perspective

12.1 The Scheduling Objective as a Potential Function

To understand why the virtual usage formulation is principled and not just a heuristic, consider the following framing. Define the cluster “imbalance potential” at time $t$ as:

\Phi(t) = \sum_{i} \left( V_i^{\text{total}}(t) \right)^2

where $V_i^{\text{total}}$ is the total virtual usage on instance $i$ . This is the sum of squared loads — a classical measure used in load balancing theory. A migration from instance $src$ (highest $V$ ) to instance $dst$ (lowest $V$ ) reduces $\Phi$ :

\Delta\Phi = \left(V_{src} - v_r\right)^2 + \left(V_{dst} + v_r\right)^2 - V_{src}^2 - V_{dst}^2

= -2 v_r (V_{src} - V_{dst}) + 2 v_r^2

For $V_{src} - V_{dst} > v_r$ (i.e., the imbalance exceeds the virtual usage of the migrated request), $\Delta\Phi < 0$ — the migration strictly reduces imbalance.

This is exactly the condition checked in Algorithm 2 (line 5): the migration threshold is precisely calibrated to ensure $V_{src} - V_{dst}$ is large enough that moving any reasonable request ( $v_r \ll V_{src} - V_{dst}$ ) reduces $\Phi$ . The scheduler is performing a greedy descent on the squared imbalance potential, which converges in $O(N \log N)$ migration steps to a state where no single migration can reduce $\Phi$ further.

Why virtual (not real) usage? The insight is that the potential function should reflect the desired load distribution, not the actual one. For de-fragmentation, the desired state is one where a target instance has a large contiguous free block — we inflate the pending request’s virtual demand on that instance to communicate this desire to the scheduler without adding any hard constraints.

12.2 Formal Downtime Bound for Pre-Copy Migration

Let us derive the downtime bound more carefully. Define:

$b_s$ = block size (bytes per KV cache page)
$B_{\text{pre}}$ = pre-copy bandwidth (bytes/sec)
$\tau_d$ = duration of one decode step (seconds)
$n_0$ = initial sequence length when migration starts
$k$ = number of decode steps during pre-copy phase

The pre-copy phase transfers $n_0$ blocks. Time to complete:

T_{\text{pre-copy}} = \frac{n_0 \cdot b_s}{B_{\text{pre}}}

During this time, the request generates $k$ new blocks:

k = \left\lfloor \frac{T_{\text{pre-copy}}}{\tau_d} \right\rfloor = \left\lfloor \frac{n_0 \cdot b_s}{B_{\text{pre}} \cdot \tau_d} \right\rfloor

The stop phase transfers only $k$ blocks:

T_{\text{stop}} = \frac{k \cdot b_s}{B_{\text{stop}}} \approx \frac{k \cdot b_s}{B_{\text{pre}}}

Substituting:

T_{\text{stop}} \approx \frac{b_s^2 \cdot n_0}{B_{\text{pre}}^2 \cdot \tau_d}

This appears to scale with $n_0$ , but in practice $n_0 \cdot b_s / (B_{\text{pre}} \cdot \tau_d) = k$ is a small integer because the pre-copy bandwidth is chosen to be much faster than the decode rate. Concretely:

For LLaMA-7B: $\tau_d \approx 10$ ms/step, $b_s \approx 0.5$ MB/block (16 heads × 128 dim × 2 bytes × 40 layers × block_size tokens), $B_{\text{pre}} = 100$ GB/s (NVLink with 20% cap). Even for $n_0 = 4096$ tokens (256 blocks), $T_{\text{pre-copy}} = 256 \times 0.5\text{MB} / 100\text{GB/s} = 1.28$ ms, so $k = 1$ block new during pre-copy. Stop phase: transfer 1 block = 5μs. Total downtime ≈ 1ms, independent of $n_0$ for reasonable bandwidth settings.

The key design parameter: $B_{\text{pre}} \gg b_s / \tau_d$ (pre-copy bandwidth must substantially exceed the rate at which new blocks are generated). If this inequality is satisfied, $k$ stays small and the downtime bound is truly constant.

13. Practical Deployment Notes

13.1 When to Enable Llumnix vs. When Not To

Scenarios where Llumnix provides clear benefit:

High utilization (>60% average KV memory): preemptions become frequent, and Llumnix’s load balancing and defragmentation directly reduce them
Mixed-length workloads: standard deviation of request length > mean → fragmentation is severe
Multi-tenant serving with SLO tiers: priority differentiation requires runtime isolation
Elastic deployments (cloud auto-scaling): Llumnix’s scale-out migration dramatically speeds up warm-up of new instances

Scenarios where Llumnix adds overhead without benefit:

Very low utilization (<30%): no preemptions, no fragmentation, migration overhead only adds noise
Homogeneous requests (all same length): no fragmentation, load stays balanced without migration
Single-instance deployment: no other instances to migrate to

13.2 Configuration Guidelines (From the Paper)

The migration threshold (Algorithm 2, line 5) is the most important tuning parameter:

Too low: excessive migrations, bandwidth contention, increased TBT variance
Too high: migrations don’t fire frequently enough, fragmentation persists

The paper recommends starting with threshold = mean request virtual usage × number of instances (a request-level rather than byte-level threshold), then adjusting based on observed migration rate vs. SLO violations.

13.3 Layered Mental Model for LLM Serving Infrastructure

After reading Llumnix, I find it helpful to think about LLM serving optimizations as forming three orthogonal layers — each layer addressing a different scope of resource management:

Layer 1: Within a single instance
  ┌─────────────────────────────────────────────────────────────┐
  │  vLLM (PagedAttention)    → Eliminate internal KV fragm.   │
  │  ORCA (Continuous batch)  → Maximize GPU decode occupancy  │
  │  FlashAttention            → Speed up attention compute     │
  └─────────────────────────────────────────────────────────────┘
         ↓ does not help with cross-instance problems

Layer 2: Between prefill and decode roles
  ┌─────────────────────────────────────────────────────────────┐
  │  DistServe (P-D disagg.)  → Separate compute per phase     │
  │  Mooncake (KV-centric)    → Cache-first KV management      │
  │  Sarathi-Serve (chunked)  → Control prefill interference   │
  └─────────────────────────────────────────────────────────────┘
         ↓ does not help with cross-instance SLO/load issues

Layer 3: Across instances at runtime (Llumnix's layer)
  ┌─────────────────────────────────────────────────────────────┐
  │  Llumnix                  → Dynamic rescheduling,          │
  │                              defragmentation, priorities,  │
  │                              auto-scaling migration        │
  └─────────────────────────────────────────────────────────────┘

These three layers are compositional: a production deployment can (and should) stack all three. Llumnix’s scheduling layer sits on top of whichever single-instance engine and role-level disaggregation the operator has already deployed.

14. Conclusion

Llumnix is a significant contribution to LLM serving infrastructure. The core insight — that LLM requests are analogous to OS processes and should be rescheduled at runtime, not just dispatched once — is both simple and powerful. The technical execution is solid: the pre-copy migration mechanism achieves near-zero downtime (provably constant with respect to sequence length when bandwidth > decode rate), and the virtual usage abstraction elegantly unifies five scheduling scenarios under one policy.

The work is most impactful in production multi-tenant serving environments where: (1) request lengths vary widely (creating fragmentation), (2) memory loads are moderate-to-high (making preemptions likely), and (3) requests have differentiated SLOs (requiring priority isolation). These conditions describe most real-world LLM API services.

The limitations are real but not disqualifying: the evaluation scope is narrow (vLLM only, Poisson workloads), and the interaction with modern features like prefix caching and long-context requests is unexplored. Future work integrating Llumnix with prefix-cache-aware dispatch and cross-engine migration would make it production-ready for a broader set of deployments.

For systems researchers, Llumnix is an exemplary paper in the tradition of applying classical OS concepts to new hardware/workload realities — a tradition that has produced vLLM (OS virtual memory → paged KV cache), DistServe (disaggregation), and now Llumnix (process scheduling → request rescheduling). The virtual usage formulation in particular deserves attention as a general technique: encoding multi-objective scheduling goals as a single scalar penalty that a standard load-balancing objective then minimizes.

Tags: LLM Serving, LLM Inference, KV Cache, Operating Systems