SlidingServe: SLO-Aware Sliding-Window Scheduling for LLM Inference

Review date: 2026-06-07 Review author: Zhongzhu Zhou Paper reviewed: Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference Paper authors: Yuansheng Chen, Yue Zhang, Xuan Mo, Weigang Wu, Jialun Li arXiv: 2606.05933v1, 2026-06-05 Venue/status: arXiv preprint, 2026

Short Answer

SlidingServe is a production-oriented LLM inference scheduler that tackles a fundamental tension in online serving: maximizing throughput while ensuring that individual requests meet their per-token latency contracts (SLOs). Most prior work either ignores SLOs entirely or enforces them with coarse-grained heuristics that fail under heterogeneous or heavy load. SlidingServe’s key insight is that single-step greedy scheduling — the dominant approach — is locally optimal but globally suboptimal: maximizing the chunk size in the current iteration consumes slack that would have allowed a larger, more efficient batch in the next iteration.

The system is organized around four tightly integrated modules: (1) a Batch Latency Predictor that estimates iteration execution time from a 7-dimensional, theory-driven feature vector using per-scenario linear models; (2) a Multi-Level Priority Sorter (MLPS) that ranks prefill and waiting requests via a lexicographic key encoding SLO protection status, urgency, and remaining workload; (3) SlidingChunker, which jointly optimizes chunk allocation across two consecutive iterations via discrete ternary search, replacing single-step maximization with a two-step sliding-window objective; and (4) BatchConstructor, which uses a 0/1 knapsack DP formulation to select which requests actually execute when the current batch is predicted to cause TTFT violations.

Empirically the system delivers convincing numbers: 9.7–30% higher goodput than QoServe (the self-declared SOTA), 16–53% lower SLO violation rate under heavy load, and a predictor with R² > 0.99 at MAE ≤ 2.72 ms — all evaluated on real hardware (RTX 3090, TP2) with standard public datasets. The critical limitation the paper underplays is scope: the evaluation is confined to a single-node, two-GPU TP2 setup with at most two models, and the DP solver’s worst-case complexity is not bounded in the paper, making it hard to assess overhead at scale.

1. Prerequisites

1.1 LLM Inference Lifecycle: Prefill, Decode, TTFT, and TBT

A transformer-based LLM generates text autoregressively. Every forward pass produces exactly one new token per sequence in the batch. The lifecycle of a single serving request consists of two distinct phases:

Prefill phase. The entire prompt (input tokens) is processed in one or a few forward passes to populate the Key–Value (KV) cache. Prefill is compute-bound: attention complexity scales as O(PS)O(P \cdot S) where PP is the prompt length and SS is the sequence length so far. For long prompts this can dominate the GPU for tens or hundreds of milliseconds.

Decode phase. After prefill, the model generates output tokens one by one. Each decode step processes just one new token per sequence while attending over the full KV cache. Decode is memory-bandwidth-bound rather than compute-bound, and each step typically takes a few milliseconds.

The key latency metrics users perceive are:

  • TTFT (Time to First Token): elapsed time from request arrival to the first output token. Sensitive to how long the request waits for prefill to complete. Acceptable range is typically 1–5 seconds for interactive applications.
  • TBT (Time Between Tokens, also called ITL — Inter-Token Latency): the time gap between successive output tokens during decode. Perceptible stuttering occurs when TBT exceeds ~100 ms (streaming threshold). SlidingServe models TBT as a sequence of per-token deadlines.

SlidingServe formalizes these as SLO constraints. For request ii with arrival time aia_i, TTFT SLO LittftL_i^{\text{ttft}}, TBT SLO LitbtL_i^{\text{tbt}}, and kk tokens already generated, the deadline for the (k+1)(k+1)-th token is:

di,k+1=ai+Littft+kLitbt(1)d_{i,k+1} = a_i + L_i^{\text{ttft}} + k \cdot L_i^{\text{tbt}} \tag{1}

This unified formulation treats both TTFT and TBT as a single timeline of deadlines, enabling a consistent scheduling objective across the prefill and decode phases.

1.2 KV Cache Memory Dynamics and PagedAttention

During decode, every request retains its KV cache — the intermediate attention key/value tensors for all previously generated tokens. KV cache memory grows linearly with sequence length: for a model with LL layers, hidden dimension dd, and a sequence of length SS, the KV cache size is approximately 2LdSsizeof(dtype)2 \cdot L \cdot d \cdot S \cdot \text{sizeof(dtype)} bytes. For a 7B-parameter model at bfloat16, this is roughly 2324096S22 \cdot 32 \cdot 4096 \cdot S \cdot 2 bytes ≈ 524 KB per token of sequence length.

Classic serving systems (prior to vLLM) pre-allocated a contiguous block of GPU memory for each request, causing severe internal fragmentation — unused space within reserved blocks — that could waste 30–40% of GPU memory. PagedAttention (vLLM, Kwon et al. 2023) virtualizes KV cache into fixed-size “pages” akin to OS virtual memory, eliminating fragmentation and enabling fine-grained memory sharing across requests with common prefixes. This allows many more requests to be co-located in a batch, dramatically improving GPU utilization.

1.3 Chunked Prefill and Why It Matters for SLO

In standard (non-chunked) serving, a long-prompt prefill executes as a single uninterrupted GPU kernel that may take 200–500 ms. During this time, no decode tokens are produced, causing TBT spikes for all co-batched decode requests.

Chunked prefill (Sarathi-Serve, Agrawal et al. 2023) decomposes long prefill into smaller “chunks” of CC tokens each, interleaving them with decode steps. This has two effects:

  1. Decode TBT is bounded to roughly Tchunk+Tdecode_only\approx T_{\text{chunk}} + T_{\text{decode\_only}} rather than Tfull_prefillT_{\text{full\_prefill}}.
  2. The scheduler gains a new degree of freedom — the chunk size CC — which becomes the primary knob for the throughput/latency tradeoff.

However, chunked prefill introduces a non-trivial scheduling problem: the optimal CC depends on the current mix of running decode requests, their remaining slack, the length of waiting prefill requests, and the non-linear relationship between batch composition and GPU execution time. This is the core challenge SlidingServe addresses.

1.4 SLO, Goodput, and SLO Attainment Rate

Service Level Objective (SLO): a contractual latency bound that the system commits to meet for each request. In LLM serving, this is typically expressed as a TTFT bound and a TBT bound.

SLO attainment rate: the fraction of requests for which all per-token deadlines are met. SlidingServe measures goodput as the request rate that keeps SLO attainment ≥ 99% (i.e., at most 1% of requests violate their SLO).

Goodput = throughput × SLO attainment rate. Higher throughput but poor SLO attainment yields low goodput. SlidingServe optimizes goodput rather than raw throughput.

The system latency budget for each iteration can be expressed as a constraint:

TiterminiDsafesi(t)(2)T_{\text{iter}} \leq \min_{i \in \mathcal{D}_{\text{safe}}} s_i(t) \tag{2}

where si(t)=di,kts_i(t) = d_{i,k} - t is the remaining slack for decode request ii at scheduling time tt, and Dsafe\mathcal{D}_{\text{safe}} is the set of decode requests whose TBT deadline is within the current window.

1.5 Continuous Batching

Continuous batching (Orca, Yu et al. 2022) allows the system to add new requests to and remove completed requests from the active batch at every iteration boundary, rather than waiting for the entire batch to finish. This eliminates the head-of-line blocking common in static batching where a short request must wait for all long requests to complete. Combined with chunked prefill, continuous batching is the foundation of modern high-throughput LLM serving.

2. Problem Statement & Motivation

2.1 The Failure Modes of Greedy Chunking

The dominant scheduling strategy for chunked prefill is what the paper calls single-step scheduling: at each iteration, compute the maximum allowed execution time from the TBT slack of the most urgent decode request, then select the largest chunk size that does not exceed this bound.

Cgreedy=TimeToBudget ⁣(miniDsi(t))(3)C_{\text{greedy}} = \text{TimeToBudget}\!\left(\min_{i \in \mathcal{D}} s_i(t)\right) \tag{3}

This is intuitive — never violate the current iteration’s deadline — but it is locally optimal, not globally optimal. The paper identifies two structural failure modes:

Failure Mode 1: Slot starvation across iterations. Because GPU execution time is a non-linear, concave-ish function of batch size (compute-bound prefill transitions to memory-bound decode), using the maximum legal chunk in iteration tt often leaves a very small remaining slack for iteration t+1t+1. The next iteration is forced to use a tiny chunk size, possibly degenerating to a pure-decode batch with poor prefill throughput. The cumulative cost over many iterations exceeds what would have been incurred by slightly reducing the chunk in iteration tt.

Failure Mode 2: Batch-level TTFT blindness. Scheduling works in two stages: first decide chunk size, then fill the batch with sorted requests. But a batch containing multiple prefill requests executes atomically — every request in the batch waits for the full batch to complete. If the batch’s predicted execution time exceeds the TTFT slack of some requests, those requests will violate even though they were “included” in the batch. Standard chunked-prefill schedulers ignore this intra-batch interference.

graph TD
    A[Request arrives\nTTFT deadline = 100ms] --> B{Batch construction}
    B -->|Greedy: max chunk| C[Batch with 4 prefill\nexec_time = 120ms]
    C --> D[TTFT violation!\nr1, r2, r3 miss deadline]
    B -->|SlidingServe| E[BatchConstructor\nremoves r2, r4]
    E --> F[Batch with r1, r3\nexec_time = 80ms]
    F --> G[r1, r3 meet TTFT\nr2, r4 deferred to next iter]
    G --> H[r2, r4 served next iter\nwithin their slack]

Figure 1: The intra-batch TTFT blindness problem. Greedy fills the batch and causes violations; BatchConstructor selectively removes requests to stay within the tightest TTFT deadline.

2.2 The Multi-dimensional Scheduling Problem

The full scheduling problem in chunked-prefill LLM serving combines three interacting sub-problems:

  1. Cross-iteration budget allocation: How should the total token budget be split between iterations tt and t+1t+1?
  2. Request prioritization: Among all waiting and prefilling requests, which should receive budget first?
  3. Intra-batch request selection: Among requests eligible for the current batch, which subset minimizes SLO violations given the predicted execution time?

These three problems interact: the chosen chunk size affects which requests fit, the ordering of requests affects which requests get the chunk, and the subset selected affects the batch’s execution time which in turn affects TTFT deadlines. SlidingServe is the first system to address all three jointly.

2.3 The Cost of Ignoring Cross-Iteration Dependencies

To make the problem concrete, consider a server running Qwen2.5-7B on TP2. Suppose decode requests impose a TBT SLO of 50 ms and the system has 8 active decode requests with minimum slack 40 ms. A greedy scheduler computes Cgreedy=TimeToBudget(40)=1800C_{\text{greedy}} = \text{TimeToBudget}(40) = 1800 tokens and fills the batch with 1800 prefill tokens. The batch executes in 38 ms, consuming 38/40 = 95% of the available slack.

In the next iteration, the minimum decode slack is 5038=1250 - 38 = 12 ms (the used slack reduces the next-window budget). Cgreedy,next=TimeToBudget(12)=350C_{\text{greedy,next}} = \text{TimeToBudget}(12) = 350 tokens — a massive reduction. The system oscillates between large and tiny batches, averaging much lower throughput than if it had used 1200 + 700 = 1900 tokens with a smoother allocation.

SlidingChunker finds B=1100B^* = 1100 for this iteration and leaves BΣB=800B_\Sigma - B^* = 800 for the next, resulting in two iterations of similar size and total throughput of 1900 tokens — the same total but without the waste from the oscillation pattern.

3. SlidingServe Architecture Overview

SlidingServe operates as a per-iteration decision loop layered on top of vLLM. The system maintains three queues:

  • Waiting Queue: newly arrived requests not yet started on prefill
  • Prefilling Queue: requests actively undergoing chunked prefill
  • Decoding Queue: requests that have completed prefill and are generating tokens

At each scheduling round, the system executes the following pipeline:

graph LR
    WQ[Waiting Queue] --> PS[Priority Sorter\nMLPS]
    PQ[Prefilling Queue] --> PS
    DQ[Decoding Queue] --> BLP[Batch Latency\nPredictor]
    PS --> CB[Candidate Batch\nmax fill]
    CB --> VC[Violation Checker]
    BLP --> VC
    VC -->|No violation risk| SC[SlidingChunker\nbranch]
    VC -->|Violation risk| BC[BatchConstructor\nbranch]
    SC --> EB[Executable Batch]
    BC --> EB
    EB --> GPU[GPU Execution]
    GPU --> UQ[Update Queues\nnext iteration]
    UQ --> WQ
    UQ --> PQ
    UQ --> DQ

Figure 2: SlidingServe architecture. The system forks into SlidingChunker (when no violations are predicted) or BatchConstructor (when violations are detected). The predictor feeds both the Violation Checker and the chunk optimizer.

The Violation Checker is the branch gate: it runs the Batch Latency Predictor on the maximally filled candidate batch and checks whether any prefill request’s TTFT deadline would be missed. If no violations are predicted, the system uses SlidingChunker to optimize the chunk size. If violations are detected, the system switches to BatchConstructor to surgically remove requests from the batch.

4. Core Technical Components

4.1 Batch Latency Predictor

The predictor is a lightweight linear model that estimates iteration wall-clock time T^\hat{T} given the current candidate batch. It is lightweight by design — it must run multiple times per scheduling round (for the binary search in SlidingChunker and the DP in BatchConstructor) without becoming a bottleneck.

Input representation. For a batch B={(ci,ui)}i=1n\mathcal{B} = \{(c_i, u_i)\}_{i=1}^n where cic_i is the token count allocated to request ii and uiu_i is the number of cached tokens (KV cache length):

D={ici1},P={ici>1}(4)\mathcal{D} = \{i \mid c_i \leq 1\}, \quad \mathcal{P} = \{i \mid c_i > 1\} \tag{4}

Three batch scenarios are classified:

s(B)={pure decode,P=0pure prefill,D=0mixed,otherwise(5)s(\mathcal{B}) = \begin{cases} \text{pure decode}, & |\mathcal{P}| = 0 \\ \text{pure prefill}, & |\mathcal{D}| = 0 \\ \text{mixed}, & \text{otherwise} \end{cases} \tag{5}

7-dimensional feature vector. The predictor uses explicit feature engineering, decomposing batch latency into theory-motivated factors:

FeatureDefinitionSemantic meaning
x1x_1iPci(ui+ci)\sum_{i \in P} c_i(u_i + c_i)Prefill attention complexity (cross-attention term)
x2x_2iPci2\sum_{i \in P} c_i^2Self-attention intensity within prefill chunk
x3x_3i=1nui\sum_{i=1}^n u_iTotal KV cache tokens (global memory pressure)
x4x_4D\lvert\mathcal{D}\rvertNumber of decode requests
x5x_5iDui\sum_{i \in D} u_iTotal context length of decode requests
x6x_6iPci\sum_{i \in P} c_iTotal prefill tokens scheduled
x7x_7maxiPci\max_{i \in P} c_iLargest single prefill allocation

Table: The 7 features of the Batch Latency Predictor. x1x_1 and x2x_2 capture quadratic attention cost; x3x_3x5x_5 capture decode memory bandwidth pressure; x6x_6x7x_7 capture prefill compute volume.

The predictor for scenario mm is:

T^=bˉ(m)+j=17wˉj(m)xj(6)\hat{T} = \bar{b}^{(m)} + \sum_{j=1}^{7} \bar{w}_j^{(m)} x_j \tag{6}

where bˉ(m)\bar{b}^{(m)} is a per-scenario intercept and wˉj(m)\bar{w}_j^{(m)} are scenario-specific linear coefficients. The system trains separate expert models for each of the three batch scenarios and falls back to a global model when a scenario has insufficient training samples.

Training. The predictor uses offline initialization from profiling data followed by asynchronous online incremental updates. The online calibration runs in a background thread and swaps the model via hot-switching without blocking the inference path. Empirically the predictor achieves MAE ≤ 2.72 ms and R² > 0.99 (Section 5.5).

Why a linear model? The choice of linear regression over a neural network is intentional: (a) it is interpretable, (b) it is extremely fast to evaluate (< 1 μs), and (c) the hand-crafted features already capture the dominant non-linearities (attention is quadratic in sequence length, captured by x1x_1 and x2x_2).

Feature engineering rationale. Let’s examine why each feature was chosen:

  • x1=iPci(ui+ci)x_1 = \sum_{i\in P} c_i(u_i + c_i): In causal self-attention, a chunk of size cic_i must attend to all uiu_i previously cached positions plus the cic_i positions within the chunk itself. The total attention work for this request is proportional to ci(ui+ci)c_i \cdot (u_i + c_i). Summing over all prefill requests gives the total cross-chunk and in-chunk attention FLOPs, making x1x_1 the most important predictor of execution time in compute-bound prefill.

  • x2=iPci2x_2 = \sum_{i\in P} c_i^2: The self-attention within a chunk contributes O(ci2)O(c_i^2) operations independently of cache depth. For large chunks (e.g., ci=512c_i = 512), this term alone contributes \sim262K operations per layer. Since x1=ciui+ci2=ciui+x2x_1 = \sum c_i u_i + \sum c_i^2 = \sum c_i u_i + x_2, retaining both x1x_1 and x2x_2 separately allows the regression to weight the cache-dependent and self-attention terms differently.

  • x3=i=1nuix_3 = \sum_{i=1}^n u_i: Total KV cache tokens across all requests (both prefill and decode) determines the memory footprint that must be loaded from HBM for each forward pass. This feature captures memory-bandwidth pressure, which dominates in large decode batches.

  • x4=Dx_4 = |\mathcal{D}| and x5=iDuix_5 = \sum_{i\in D} u_i: The number of decode requests and their total context length together determine the decode-phase compute and memory access pattern. In a mixed batch, decode requests are constant-overhead in compute (one token each) but variable-overhead in memory (proportional to context length).

  • x6=iPcix_6 = \sum_{i\in P} c_i and x7=maxiPcix_7 = \max_{i\in P} c_i: Total prefill tokens is the primary volume metric. The maximum single-request allocation captures the worst-case memory allocation event in the batch and helps predict spikes caused by a single very large chunk.

The linear model with these features achieves R² > 0.99 because GPU kernel execution time is nearly piecewise-linear in these FLOP/memory-access quantities for a fixed model and hardware configuration. Batch-level latency is dominated by the single longest-running kernel, which in mixed batches is typically the fused attention + MLP kernel for the prefill portion — and that kernel’s time is approximately w1x1+w2x2+fixed overheadw_1 x_1 + w_2 x_2 + \text{fixed overhead}.

4.2 SlidingChunker

SlidingChunker replaces single-step greedy chunk selection with a joint two-iteration optimization. The key insight: by slightly under-utilizing the current iteration’s slack, the next iteration can use a larger chunk, and the combined throughput of both iterations exceeds what single-step maximization achieves.

Step-by-step algorithm:

Algorithm 1: SlidingChunker
Input:  D (decode requests), Q (sorted prefill+wait requests),
        B_max (server max chunk), t (current time),
        F (BatchForwarder with predictor)
Output: B* (optimal chunk size), A* (request-level allocation)

1.  Compute current window deadline:
      T_cur ← min_{i ∈ D_safe} s_i(t)              // tightest TBT slack

2.  Compute next window deadline:
      T_next ← min_{i ∈ D_safe} (s_i(t) - T_cur + L_i^tbt)

3.  Compute total budget B_Σ using binary search:
      B_Σ ← TimeToBudget(T_cur + T_next)

4.  Ternary search over b ∈ [1, B_Σ]:
    Find B* = argmin_b [ T̂(b) + T̂(B_Σ - b) ]
      where T̂(b) = F.predict(batch formed with chunk b)

5.  Construct A* = request-level allocation for chunk B*
    using sorted queue Q (fill up to B* tokens)

6.  Return (B*, A*)

The window constraints are derived from TBT SLO slack:

Tcur=miniDsafesi(t)(7)T_{\text{cur}} = \min_{i \in \mathcal{D}_{\text{safe}}} s_i(t) \tag{7} Tnext=miniDsafe(si(t)Tcur+Litbt)(8)T_{\text{next}} = \min_{i \in \mathcal{D}_{\text{safe}}} \left(s_i(t) - T_{\text{cur}} + L_i^{\text{tbt}}\right) \tag{8}

The optimization objective is:

B=argminb[T^(b)+T^(BΣb)](9)B^* = \arg\min_b \left[\hat{T}(b) + \hat{T}(B_\Sigma - b)\right] \tag{9}
sequenceDiagram
    participant SC as SlidingChunker
    participant BLP as BatchLatencyPredictor
    participant Q as SortedQueue

    SC->>SC: Compute T_cur, T_next from TBT slacks
    SC->>SC: Binary search for B_Σ (total budget)
    loop Ternary search over b ∈ [1, B_Σ]
        SC->>BLP: predict(batch_with_chunk=b)
        BLP-->>SC: T̂(b)
        SC->>BLP: predict(batch_with_chunk=B_Σ-b)
        BLP-->>SC: T̂(B_Σ-b)
        SC->>SC: Update B* if T̂(b)+T̂(B_Σ-b) improves
    end
    SC->>Q: Fill batch up to B* tokens
    Q-->>SC: A* (request-level allocation)

Figure 3: SlidingChunker request flow. The predictor is called O(log B_Σ) times during ternary search. The BatchForwarder wraps the predictor with queue state to simulate full batch construction.

Why ternary search? The objective T^(b)+T^(BΣb)\hat{T}(b) + \hat{T}(B_\Sigma - b) is unimodal in bb when execution time is a monotone convex function of chunk size, which holds empirically for transformer inference. Ternary search finds the optimum in O(logBΣ)O(\log B_\Sigma) predictor calls, typically < 20 iterations.

4.3 Multi-Level Priority Sorter (MLPS)

The MLPS sorts all candidate prefill and waiting requests before batch construction. Decode requests are always included first (they consume exactly 1 token each and are highly latency-sensitive); the remaining budget is allocated to the sorted prefill/wait queue.

Definitions. For request ii at scheduling round tt:

  • Remaining prefill tokens: ri(t)=pici(t)r_i(t) = p_i - c_i(t) where pip_i is prompt length and ci(t)c_i(t) is tokens already computed
  • TTFT slack: si(t)=ai+Littftts_i(t) = a_i + L_i^{\text{ttft}} - t
  • System throughput: ρt\rho_t (recent observed tokens/sec)
  • Estimated time to complete prefill: T^ipre(t)=ri(t)/ρt\hat{T}_i^{\text{pre}}(t) = r_i(t) / \rho_t

Normalized urgency:

ui(t)=T^ipre(t)max(si(t),ϵ)=ri(t)ρtmax(si(t),ϵ)(10)u_i(t) = \frac{\hat{T}_i^{\text{pre}}(t)}{\max(s_i(t), \epsilon)} = \frac{r_i(t)}{\rho_t \cdot \max(s_i(t), \epsilon)} \tag{10}

This ratio measures “estimated prefill time / remaining slack.” When ui(t)>1u_i(t) > 1, the request cannot complete prefill before its TTFT deadline even at full throughput. The urgency flag is:

ei(t)=1[ui(t)>α](11)e_i(t) = \mathbf{1}[u_i(t) > \alpha] \tag{11}

where α\alpha is a configurable threshold (e.g., α=1.0\alpha = 1.0). Additionally, gi{0,1}g_i \in \{0,1\} is a protection flag for requests needing special treatment (e.g., privileged users, requests mid-stream).

Priority key (three levels):

Ki(t)=(1gi, 1ei(t), ri(t))(12)K_i(t) = \left(1 - g_i,\ 1 - e_i(t),\ r_i(t)\right) \tag{12}

Sorting in ascending order of Ki(t)K_i(t) produces:

Qt=LexSortiPt ⁣(1gi, 1ei(t), ri(t))(13)Q_t = \text{LexSort}_{i \in \mathcal{P}_t}\!\left(1-g_i,\ 1-e_i(t),\ r_i(t)\right) \tag{13}

Three priority levels explained:

  1. Level 1 (SLO safeguard, 1gi1-g_i): Protected requests (small 1gi1-g_i) always come first. These are requests that have been explicitly flagged as needing additional protection.
  2. Level 2 (urgency, 1ei(t)1-e_i(t)): Among unprotected requests, urgent requests (ei=1e_i = 1, small 1ei1-e_i) come before non-urgent ones. This prevents near-deadline requests from being blocked by shorter, lower-risk requests.
  3. Level 3 (remaining workload, ri(t)r_i(t)): Among requests at the same urgency level, shorter-remaining-workload requests come first. This maximizes the number of prefill completions per unit budget — a shortest-job-first heuristic within the urgency class.
Algorithm 2: Multi-Level Priority Sort
Input:  P_t = P_t^prefill ∪ P_t^wait (all prefill+wait requests)
        ρ_t (recent system throughput), t (current time)
        α (urgency threshold), ε (numerical stability constant)
Output: Q_t (sorted candidate queue)

For each request i in P_t:
    1. r_i(t) ← p_i - c_i(t)          // remaining prefill tokens
    2. s_i(t) ← a_i + L_i^ttft - t    // TTFT slack
    3. û_i(t) ← r_i(t) / (ρ_t · max(s_i(t), ε))   // urgency ratio
    4. e_i(t) ← 1 if û_i(t) > α else 0              // urgent flag
    5. K_i(t) ← (1-g_i, 1-e_i(t), r_i(t))          // priority key

Return LexSort_ascending(P_t, key=K_i(t))

4.4 BatchConstructor

BatchConstructor is activated when the Violation Checker determines that executing the maximally filled candidate batch would cause at least one request to violate its TTFT deadline. Rather than blindly dropping all at-risk requests or executing the full batch anyway, BatchConstructor formulates the problem as a constrained request selection problem.

Core DP formulation. For each “risky request” aa (a request whose TTFT slack would be violated by the full batch):

  1. Compute the token capacity available when anchoring on aa‘s deadline:
Ca=BaD(14)C_a = B_a - |\mathcal{D}| \tag{14}

where Ba=TimeToBudget(sa)B_a = \text{TimeToBudget}(s_a) is the token budget corresponding to aa‘s TTFT slack, and D|\mathcal{D}| accounts for the fixed decode request budget.

  1. Force-include request aa in the batch (it is the anchor). Define the remaining capacity after aa:
capacity=Cara(15)\text{capacity} = C_a - r_a \tag{15}
  1. For all other candidate requests jj with sjsas_j \geq s_a (requests with at least as much remaining slack as the anchor), define:
    • Weight: wj=rjw_j = r_j (remaining prefill tokens, the “cost” of including jj)
    • Value: composed of slack-urgency and workload terms:
vj=1sjkSask+rjkSark(16)v_j = \frac{1}{\dfrac{s_j}{\sum_{k \in S_a} s_k} + \dfrac{r_j}{\sum_{k \in S_a} r_k}} \tag{16}

This value function rewards requests with smaller relative slack (more urgent) and smaller relative remaining workload (easier to complete). The harmonic-mean-like structure balances the two criteria.

  1. Solve the 0/1 knapsack DP:
maxYSa{a}jYvjs.t.jYrjCara(17)\max_{\mathcal{Y} \subseteq S_a \setminus \{a\}} \sum_{j \in \mathcal{Y}} v_j \quad \text{s.t.} \quad \sum_{j \in \mathcal{Y}} r_j \leq C_a - r_a \tag{17}

The recurrence relation for the knapsack DP over requests sorted by index:

dp[j][c]=max ⁣{dp[j1][c](skip request j)dp[j1][crj]+vj(include request j, crj)(18)\text{dp}[j][c] = \max\!\begin{cases} \text{dp}[j-1][c] & \text{(skip request } j) \\ \text{dp}[j-1][c - r_j] + v_j & \text{(include request } j,\ c \geq r_j) \end{cases} \tag{18}
  1. The selection for anchor aa is Ya=Y{a}\mathcal{Y}_a = \mathcal{Y}^* \cup \{a\}.

Selecting the best anchor. BatchConstructor runs the above DP for each risky request and picks the best solution using a lexicographic comparator:

Y1Y2if(Y1, jY1vj, jY1rj)>lex(Y2, jY2vj, jY2rj)(19)\mathcal{Y}_1 \succ \mathcal{Y}_2 \quad \text{if} \quad \left(|\mathcal{Y}_1|,\ \sum_{j \in \mathcal{Y}_1} v_j,\ \sum_{j \in \mathcal{Y}_1} r_j\right) >_{\text{lex}} \left(|\mathcal{Y}_2|,\ \sum_{j \in \mathcal{Y}_2} v_j,\ \sum_{j \in \mathcal{Y}_2} r_j\right) \tag{19}

The comparator first maximizes the number of completed prefills, then total value, then total budget used.

Final batch allocation:

A={(i,1)iD}{(j,rj)jY}(20)A^* = \{(i, 1) \mid i \in \mathcal{D}\} \cup \{(j, r_j) \mid j \in \mathcal{Y}^*\} \tag{20}

Each decode request gets exactly 1 token; each selected prefill request gets its full remaining tokens (to complete prefill in this round and immediately generate the first token).

Algorithm 3: BatchConstructor
Input:  candidate_batch (sorted by MLPS), predictor,
        D (decode requests), T_full (predicted full batch latency)
Output: A* (optimal batch allocation), or None if no violations

1. Compute T̂_full = predictor.predict(candidate_batch)
2. If all TTFT slacks ≥ T̂_full: return None  // no violations, use SlidingChunker

3. risky ← {i ∈ prefill_requests : s_i(t) < T̂_full}
4. best ← None

5. For each anchor a ∈ risky:
    a. B_a ← TimeToBudget(s_a)
    b. C_a ← B_a - |D|
    c. If r_a > C_a: continue  // anchor alone exceeds capacity, skip

    d. S_a ← {j ∈ candidates : s_j ≥ s_a, j ≠ a}
    e. Compute v_j for each j ∈ S_a   // value function (Eq. 16)

    f. // 0/1 knapsack DP
       capacity ← C_a - r_a
       dp[0..n][0..capacity] ← 0
       For j = 1...|S_a|:
           For c = 0...capacity:
               dp[j][c] ← dp[j-1][c]
               If c ≥ r_{S_a[j]}:
                   dp[j][c] ← max(dp[j][c], dp[j-1][c-r_{S_a[j]}] + v_{S_a[j]})
       Y* ← backtrack(dp)  // selected requests
       Y_a ← Y* ∪ {a}

    g. If best == None or Y_a ≻ best: best ← Y_a

6. A* ← {(i,1) | i∈D} ∪ {(j, r_j) | j∈best}
7. Return A*

4.5 Integration and Control Flow

The four components integrate as follows: MLPS runs unconditionally every iteration (O(NlogN)O(N \log N)). The Violation Checker runs the predictor on the maximum candidate batch (1 predictor call). If no violation is detected, SlidingChunker runs (~20 predictor calls for ternary search). If a violation is detected, BatchConstructor runs (R×Sa|\mathcal{R}| \times |S_a| DP steps). In steady state under normal load (no violations), the per-iteration scheduling cost is dominated by MLPS and SlidingChunker’s ternary search. Under overload (frequent violations), BatchConstructor dominates.

The modular separation of the four components is a key engineering strength: MLPS, SlidingChunker, and BatchConstructor can each be disabled independently, enabling the clean ablation study in Section 5.4. The predictor is the only shared dependency — it feeds both the Violation Checker and the chunk optimizer, making its accuracy critical for both components simultaneously.

vLLM integration specifics. SlidingServe extends vLLM’s request metadata to carry per-request SLO fields (TTFT and TBT SLO values). The scheduler runs before each vLLM iteration to inject the token allocation (AA^*) into vLLM’s block manager. The system is designed to be a drop-in replacement for vLLM’s default scheduler, preserving compatibility with vLLM’s PagedAttention memory management and CUDA graph execution path.

5. Scheduling Algorithm Deep Dive

This section traces a complete scheduling round end-to-end, showing how each component interacts.

Setup: Suppose at time t=50t = 50 ms, the system has:

  • Decode queue D\mathcal{D}: 8 requests, most urgent decode slack sminD=30s_{\min}^D = 30 ms
  • Prefilling queue: 3 requests (r1: 500 tokens remaining, r2: 800 tokens, r3: 200 tokens)
  • Waiting queue: 2 requests (r4: 1024 tokens, r5: 300 tokens)
  • Recent throughput ρt=5000\rho_t = 5000 tokens/sec

In this trace, all four components interact: MLPS determines the order of candidates, SlidingChunker would determine the chunk size under normal load, BatchConstructor handles the violation, and the predictor underlies all latency estimates. The trace demonstrates that these components are not independently useful — BatchConstructor’s decision depends on the MLPS ordering (which requests are candidates), and SlidingChunker’s next-iteration estimate feeds into the predictor calibration.

Step-by-step walkthrough:

Step 1 — MLPS sorts candidates. Compute urgency ui(t)u_i(t) for r1–r5:

  • r3: remaining 200, slack 80 ms → u=200/(5000×0.08)=0.5u = 200/(5000 \times 0.08) = 0.5 → not urgent
  • r1: remaining 500, slack 40 ms → u=500/(5000×0.04)=2.5u = 500/(5000 \times 0.04) = 2.5 → urgent
  • r2: remaining 800, slack 120 ms → u=800/(5000×0.12)=1.33u = 800/(5000 \times 0.12) = 1.33 → urgent
  • r4: remaining 1024, slack 200 ms → u=1.02u = 1.02 → borderline
  • r5: remaining 300, slack 60 ms → u=1.0u = 1.0 → borderline

Sorted by Ki=(1gi,1ei,ri)K_i = (1-g_i, 1-e_i, r_i) ascending: r1 (urgent, 500), r2 (urgent, 800), r5, r4, r3.

Step 2 — Build candidate batch. Using last iteration’s chunk size, fill greedily: r1 (500 tokens), r2 (800 tokens), r3 (200 tokens), r5 (300 tokens) — until budget B_max = 2048 is reached. Batch total = 1800 tokens + 8 decode = 1808.

Step 3 — Violation Checker. Predictor estimates T^full=45\hat{T}_{\text{full}} = 45 ms. Request r1 has slack s1=40s_1 = 40 ms < 45 ms → violation! Violation Checker routes to BatchConstructor.

Step 4 — BatchConstructor. Anchor set: {r1} (only r1 has slack < 45 ms).

  • For anchor r1: Br1=TimeToBudget(40ms)=1600B_{r1} = \text{TimeToBudget}(40\text{ms}) = 1600 tokens, Cr1=16008=1592C_{r1} = 1600 - 8 = 1592.
  • Remaining capacity after r1: 1592500=10921592 - 500 = 1092.
  • Candidates Sr1S_{r1}: r2, r3, r5 (have slack ≥ 40 ms).
  • Run knapsack DP with capacity 1092, weights [800, 200, 300], compute values.
  • Optimal: include r3 (200) + r5 (300) + possibly r2 (800 > 1092-200-300=592? No, 800 > 592) → select r3 + r5.
  • Final batch: r1 (500), r3 (200), r5 (300) + 8 decode = 1008 tokens.
  • Predicted T^=32\hat{T} = 32 ms < 40 ms = r1’s slack ✓

Step 5 — GPU execution. Batch executes. r1, r3, r5 complete prefill and generate first tokens. r2, r4 deferred to next iteration.

Step 6 — Queue update. r1, r3, r5 move to Decoding Queue. r2 remains in Prefilling Queue. r4 remains in Waiting Queue.

Next iteration — SlidingChunker path. No violations predicted (r1, r3, r5 are now decoding; batch is lighter). SlidingChunker computes Tcur=sminD=30T_{\text{cur}} = s_{\min}^D = 30 ms, TnextT_{\text{next}}, finds B=1200B^* = 1200 tokens (ternary search), yielding a larger batch than the overly-conservative BgreedyB_{\text{greedy}} would allow.

flowchart TD
    S([Start of iteration]) --> MLPS[MLPS: sort all\nprefill+wait requests]
    MLPS --> CB[Build candidate batch\ngreedy fill to B_max]
    CB --> VC{Violation Checker\npredict T̂_full}
    VC -- "T̂_full ≤ min(s_i): NO violation" --> SC[SlidingChunker\nternary search for B*]
    VC -- "T̂_full > some s_i: VIOLATION" --> BCon[BatchConstructor\nDP per anchor]
    SC --> EB[Executable batch A*]
    BCon --> EB
    EB --> GPU[GPU forward pass]
    GPU --> UPD[Update queues\ncompute metrics]
    UPD --> S

Figure 4: Complete per-iteration scheduling flowchart. The Violation Checker gates which optimizer runs. Both optimizers produce the same output type (batch allocation A), ensuring a clean interface to the GPU execution layer.*

6. Theoretical Analysis

6.1 Why Sliding Window Outperforms Greedy: Formal Argument

Let f(b)f(b) denote the execution time of a batch with chunk size bb (assumed monotone non-decreasing, convex in bb due to increasing attention quadratic cost). The greedy approach maximizes b1b_1 subject to f(b1)Tcurf(b_1) \leq T_{\text{cur}}, leaving b2=BΣb1b_2 = B_\Sigma - b_1 for the next iteration.

The total tokens processed over two iterations is:

total(b1)=b1+(BΣb1)=BΣ(21)\text{total}(b_1) = b_1 + (B_\Sigma - b_1) = B_\Sigma \tag{21}

But “tokens processed” is not the same as “throughput” — what matters is processing under the respective deadline constraints TcurT_{\text{cur}} and TnextT_{\text{next}}. The SlidingChunker objective is to minimize total predicted execution time:

B=argminb[T^(b)+T^(BΣb)](22)B^* = \arg\min_b \left[\hat{T}(b) + \hat{T}(B_\Sigma - b)\right] \tag{22}

When T^\hat{T} is convex and symmetric, the minimum of T^(b)+T^(BΣb)\hat{T}(b) + \hat{T}(B_\Sigma - b) occurs at b=BΣ/2b = B_\Sigma / 2 (equal split), which is generically different from the greedy maximum bgreedy=TimeToBudget(Tcur)b_{\text{greedy}} = \text{TimeToBudget}(T_{\text{cur}}). For the paper’s example (deadline 100 ms / 50 ms), the greedy solution yields 260 tokens for the next iteration, while SlidingChunker yields a balanced split that processes 100 additional tokens total.

6.2 Time Complexity of BatchConstructor

Let nrn_r be the number of risky requests (anchors) and ncn_c be the number of candidate prefill requests. For each anchor aa:

  • Candidate set Sanc|S_a| \leq n_c
  • Capacity CaBmaxC_a \leq B_{\max} (bounded by max budget)
  • Knapsack DP runs in O(SaCa)O(|S_a| \cdot C_a)

Total complexity: O(nrncBmax)O(n_r \cdot n_c \cdot B_{\max}). In practice:

  • nrncn_r \ll n_c (violations are rare in steady state)
  • ncn_c is bounded by the batch size (typically 10–50 requests)
  • BmaxB_{\max} is bounded by 4096 or similar

So BatchConstructor runs in effectively O(nc2Bmax)O(n_c^2 \cdot B_{\max}) in the worst case. The paper does not analyze this complexity or discuss what happens under pathological violation cascades. For very large batches (e.g., 100+ concurrent requests), this could become significant.

TBatchConstructor=O ⁣(nrncCmax)(23)T_{\text{BatchConstructor}} = O\!\left(n_r \cdot n_c \cdot C_{\max}\right) \tag{23}

6.2b BatchConstructor as a Constrained Set Selection Problem

The BatchConstructor problem can be viewed through the lens of constrained combinatorial optimization. For each anchor aa, we solve a 0/1 knapsack over candidate requests with weights rjr_j (remaining tokens) and values vjv_j. This is NP-hard in general but tractable in practice because: (a) the number of candidate requests per anchor is small (Sa50|S_a| \leq 50 in typical serving), and (b) the capacity CaC_a is bounded by the max chunk size (4096\leq 4096 tokens). The DP table has at most 50×409620050 \times 4096 \approx 200K entries, computable in < 1 ms.

The greedy alternative (select by descending vj/rjv_j/r_j ratio, the fractional knapsack heuristic) yields at most a 2-approximation for 0/1 knapsack. SlidingServe uses exact DP because at the relevant scale, the problem is tractable — a sound engineering judgment that prioritizes correctness over asymptotic efficiency.

6.3 Throughput vs. SLO Attainment Tradeoff

The system throughput λ\lambda (requests served per second while meeting SLOs) is bounded by:

λ1E[Titer]E[requests completed per iter](24)\lambda \leq \frac{1}{\mathbb{E}[T_{\text{iter}}]} \cdot \mathbb{E}[\text{requests completed per iter}] \tag{24}

SlidingChunker increases E[batch size]\mathbb{E}[\text{batch size}] by 16.7% (ablation study, Section 5.4), directly increasing the numerator. BatchConstructor reduces E[Titer]\mathbb{E}[T_{\text{iter}}] for violation-prone batches, increasing the denominator’s efficiency. The interaction between the two is positive but not independently additive: both components improve goodput together (27.8% combined in ablation, 5.7% + 16.7% + 5.4% from the three components).

7. Experiments & Results

7.1 Setup

DimensionConfiguration
ModelsLlama3-8B, Qwen2.5-7B
HardwareNVIDIA RTX 3090 × 2, tensor parallel degree 2
Base frameworkvLLM
DatasetsShareGPT (dialogue), arXiv-Summarization-v1/v2 (long text), mixed-v1 (ShareGPT:arXiv-v1 = 3:1), mixed-v2 (5:1)
BaselinesSarathi-EDF (EDF scheduling on Sarathi-Serve), QoServe (ASPLOS 2026, self-declared SOTA)
MetricGoodput (req/s at ≤1% SLO violation), SLO violation rate (%), p50/p90/p99 TTFT and TBT

SLO settings: TTFT SLO expressed as Maximum TTFT Slowdown (ratio to exclusive-service latency). TBT SLO specified per dataset.

7.2 Key Results

Goodput evaluation (Section 5.1): SlidingServe achieves 25–111% higher goodput than Sarathi-EDF and 9.7–30% higher than QoServe across all datasets and models. The largest gains are on arXiv (long prompts), where SlidingChunker’s multi-iteration optimization is most impactful. On ShareGPT (short prompts), gains shrink (the sliding window degenerates when chunks span a single iteration).

SLO violations under overload (Section 5.2): At high load (2× goodput point), SlidingServe reduces the SLO violation rate by up to 53% compared to QoServe. The p50/p90 TTFT of SlidingServe is comparable to QoServe, but tail latency (p99) is occasionally higher — the paper explains this as an intentional tradeoff: requests that have already missed their SLO are deprioritized to protect others.

Transient overload (Section 5.3): Under a load pattern alternating between QPS=1.0 and QPS=2.5 every 2 minutes (20 min total, simulating a real production cycle):

  • SlidingServe: 30.22% fewer cumulative violations than Sarathi-EDF
  • SlidingServe: 23.74% fewer cumulative violations than QoServe SlidingServe adapts faster to load spikes because MLPS dynamically re-ranks urgency at every iteration.

Ablation study (Section 5.4), Qwen2.5-7B, mixed-v1:

ConfigurationGoodput vs. Sarathi-EDFSLO Violation Rate
Sarathi-EDF (baseline)+0%high
+SlidingChunker+16.7%moderate
+SlidingChunker +MLPS+22.4%lower
Full SlidingServe+27.8%lowest
BatchConstructor alone+5.4% (SLO violations)significant reduction

Figure 5: Ablation results. SlidingChunker contributes the most to throughput (+16.7%); MLPS and BatchConstructor add incremental improvements and are especially effective at reducing SLO violations under high load.

Predictor accuracy (Section 5.5):

GPU ConfigMAE (ms)RMSE (ms)
RTX 3090 (TP1)2.524.12> 0.99
RTX 3090 (TP2)2.724.33> 0.99

Figure 6: Batch Latency Predictor accuracy. The model generalizes across GPU configurations and achieves < 3 ms MAE, well below the 10–50 ms iteration times, making it a reliable input to the scheduling algorithms.

8. Comparison with Prior Art

graph LR
    subgraph Throughput-Only
        ORCA[Orca\nContinuous batching\nNo SLO]
        vLLM[vLLM\nPagedAttention\nNo SLO]
    end
    subgraph Chunked-Prefill
        SARATHI[Sarathi-Serve\nChunked prefill\nCoarse SLO]
        SARATHIEDF[Sarathi-EDF\nEDF priority\nSLO-aware]
    end
    subgraph SLO-Aware
        QOSERVE[QoServe\nHybrid priority\nProactive relegation]
        SLIDINGSERVE[SlidingServe\nSliding window\nDP batch construction]
        SOLA[SOLA\nState-aware scheduling\nMLSys 2025]
        POLYSERVE[PolyServe\nLoad-gradient routing]
    end
    ORCA --> vLLM
    vLLM --> SARATHI
    SARATHI --> SARATHIEDF
    SARATHIEDF --> QOSERVE
    QOSERVE --> SLIDINGSERVE

Figure 7: Evolution of LLM serving schedulers. SlidingServe sits at the frontier of SLO-aware scheduling, extending QoServe’s approach with sliding-window chunk optimization and DP-based intra-batch selection.

SystemChunked PrefillSLO-AwareMulti-Iter OptIntra-batch DPPredictor
Orca (2022)
vLLM (2023)
Sarathi-Serve (2023/24)Partial
Sarathi-EDF✓ (EDF)
QoServe (ASPLOS 2026)Partial
SlidingServe (2026)✓ (window-2)✓ (0/1 knapsack)✓ (linear)

Sarathi-EDF in detail. Sarathi-Serve (Agrawal et al., 2023/2024) introduced chunked prefill to LLM serving. The EDF extension adds deadline-based priority ordering but uses a fixed chunk size per request configuration and no cross-iteration optimization. Its SLO-awareness is purely reactive: it prioritizes urgent requests but cannot look ahead to avoid future violations. Under heavy load, Sarathi-EDF fails because EDF alone cannot compensate for the global chunk-size miscalibration.

QoServe (ASPLOS 2026) in detail. QoServe improves over Sarathi-EDF with: (a) fine-grained QoS tier classification (grouping requests by SLO class), (b) hybrid prioritization combining deadline urgency with estimated processing time, and (c) proactive relegation — downgrading requests that have already violated to make room for others. Compared to SlidingServe, QoServe lacks: (a) a latency predictor (it estimates execution time from coarser statistics), (b) explicit cross-iteration joint optimization (its dynamic chunking is single-step), and (c) intra-batch DP selection. The 9.7–30% goodput gap suggests these three additions are complementary and each carries measurable impact.

Key differentiators of SlidingServe vs. QoServe (the closest baseline):

  • QoServe uses “hybrid prioritization and proactive relegation” — heuristic, no joint optimization.
  • SlidingServe performs explicit two-iteration joint optimization with a predictor.
  • SlidingServe explicitly addresses intra-batch TTFT blindness; QoServe does not.
  • SlidingServe achieves 9.7–30% higher goodput than QoServe on the tested configurations.

9. Critical Assessment: Weaknesses & Improvements

9.a Weaknesses and Flaws

1. Evaluation scale is very narrow. The entire evaluation uses a single hardware setup: 2× RTX 3090 with TP2. The RTX 3090 is a consumer/prosumer GPU, not representative of production A100/H100 clusters. The paper provides no evidence that the 7-dimensional predictor generalizes to different GPU architectures (e.g., H100 with HBM3, different memory bandwidth profiles). The predictor’s linear features x1=ci(ui+ci)x_1 = \sum c_i(u_i + c_i) capture attention quadratic cost correctly for standard attention, but would need re-fitting for hardware with FlashAttention-2/3 implementations that have different effective bandwidth characteristics.

2. Only two models tested. Llama3-8B and Qwen2.5-7B are both ~7B parameter models. The paper makes no claims about scalability to 70B or 405B parameter models, where the memory bandwidth pressure (and thus the predictor’s regime) changes qualitatively. In production, a 70B model in BF16 with TP8 would have very different chunk-size/latency relationships.

3. BatchConstructor worst-case complexity is not bounded in paper. The DP in BatchConstructor has complexity O(nrncCmax)O(n_r \cdot n_c \cdot C_{\max}). Under sustained overload, nrn_r (number of risky requests) could be large, making BatchConstructor a scheduling bottleneck. The paper reports that the predictor calibration “does not block the main inference path,” but does not discuss the overhead of BatchConstructor itself. No latency breakdown of the scheduler’s CPU time is provided.

4. SLO types are limited. The system targets TTFT and TBT SLOs. It does not address end-to-end latency SLOs (total generation time), SLO differentiation across multiple tenants or user tiers, or output length SLOs. Production environments often have more complex SLO structures (e.g., P99 TTFT ≤ X AND median TBT ≤ Y AND max output length bounded).

5. Baseline selection is questionable. The paper compares against Sarathi-EDF (a custom implementation) and QoServe (ASPLOS 2026). It does not compare against SOLA (MLSys 2025), PolyServe (2025), WindServe (ISCA 2025), or APT-Serve (2025), which are all recent SLO-aware schedulers cited in the related work. The claim “up to 30% vs. advanced schedulers” is based on comparison with only one truly competitive system.

6. The sliding window size of 2 is hardcoded. The paper studies only a window of size 2 (current + next iteration). There is no analysis of whether a larger window (3–4 iterations) would yield further gains, or whether the two-step approximation is near-optimal in practice. The choice of window size 2 is motivated by intuition rather than ablation.

9.b Limitations the Authors Understate

Degraded performance on short-prompt workloads. The paper acknowledges that “on ShareGPT (short prompts), gains shrink” but does not quantify the overhead cost. If SlidingServe introduces meaningful CPU scheduling overhead (predictor calls + ternary search + DP per iteration) that is not offset by throughput gains on short workloads, it could be worse than Sarathi-EDF for chatbot applications with short prompts.

Online predictor calibration concurrency. The predictor is updated “asynchronously in a background thread and replaces the model via hot-swapping.” Hot-swapping a linear model is fast, but the paper does not discuss the data collection and re-training overhead, nor the staleness of the model during rapid load changes. If the load changes faster than the re-training interval, the predictor may be systematically mis-calibrated.

Interaction with speculative decoding and quantization. Modern serving stacks increasingly use speculative decoding (which fundamentally changes the per-token execution model) and INT4/INT8 quantization (which changes the compute/bandwidth balance). SlidingServe’s predictor features may not capture these effects, and the scheduling algorithms would need to be re-designed.

9.c Concrete Improvement Suggestions

  1. Extend the predictor with hardware-agnostic features: Replace x1,x2x_1, x_2 (raw token counts) with actual FLOPs estimates normalized by GPU peak FLOPS and memory bandwidth. This would improve cross-hardware generalization without requiring full re-training.

  2. Add a window-size ablation. Test window sizes 2, 3, 4 to determine the marginal value of look-ahead. This would either validate the two-step approximation or reveal gains from longer horizons.

  3. Bound and profile BatchConstructor overhead. Add a CPU-side scheduler latency breakdown in the evaluation, similar to how vLLM reports scheduling overhead. If DP overhead is > 1 ms per iteration, it becomes a meaningful bottleneck at 10–50 ms iteration times.

  4. Evaluate on A100/H100 hardware. The paper would be significantly more impactful if it included results on datacenter-grade GPUs, which dominate production LLM deployments.

  5. Handle output-length uncertainty. The current system does not model output length distributions. A simple extension: maintain a per-request output-length histogram from historical data and use it to predict future decode queue pressure, improving the next-window estimate in SlidingChunker.

  6. Add comparison with SOLA and WindServe. Including the full slate of 2025 SLO-aware schedulers would make the empirical claims stronger and allow readers to understand where SlidingServe sits in the broader landscape.

9.d Implementation Notes and Hidden Costs

Predictor hot-swap concurrency. The paper describes the predictor as being updated via “hot-swapping” from a background thread. In a multi-threaded scheduler, this requires either a read-writer lock (introducing lock contention on every prediction call) or a copy-on-write model pointer swap. Neither approach is analyzed in the paper. At high request rates (e.g., 100 req/s × 50 iterations/s = 5000 scheduling calls/second), even microsecond-scale lock contention could be measurable.

The TimeToBudget inverse function. Both SlidingChunker and BatchConstructor rely on TimeToBudget(T), which inverts the predictor: given a latency target TT, find the largest chunk bb such that T^(b)T\hat{T}(b) \leq T. The paper implements this as binary search over bb. For a linear predictor, this could be computed analytically (b=(TT^offset)/wslopeb^* = (T - \hat{T}_{\text{offset}}) / w_{\text{slope}}), but the non-linearity introduced by the mixed-batch scenario selection makes binary search necessary. The overhead is O(logBmax)O(\log B_{\max}) predictor evaluations, typically 12–15 calls.

Memory overhead of three queues. Maintaining three separate queues with per-request metadata (arrival time, SLO parameters, cumulative token count, protection flag, urgency estimate) adds O(N)O(N) memory overhead where NN is the total number of live requests. For a system with 1000 concurrent requests, this is trivially small (< 1 MB), but the priority re-sorting at every iteration is O(NlogN)O(N \log N) in the naive case. The paper does not discuss whether MLPS is implemented with an incremental data structure (e.g., a tournament tree that updates in O(logN)O(\log N) per request state change) or a full re-sort each iteration.

10. Limitations & Boundary Conditions

The SLO violation rate vs. throughput Pareto frontier. A fundamental insight from the experiments: SlidingServe does not merely trade off throughput for SLO compliance — it expands the Pareto frontier. At the same SLO violation rate (e.g., ≤1%), SlidingServe achieves higher throughput than QoServe. At the same throughput, it achieves a lower violation rate. This is only possible because the components reduce scheduling waste: SlidingChunker reduces the number of iterations wasted on inefficient chunk sizes; MLPS reduces the budget allocated to requests that have no chance of meeting their SLO; BatchConstructor prevents atomic batch-level SLO violations before they happen.

The fundamental reason greedy scheduling cannot be Pareto-optimal: it maximizes a local objective (current iteration throughput) that is not aligned with the global objective (goodput over many iterations). SlidingServe’s two-step joint optimization is the minimal extension needed to break this local optimality trap without introducing exponential complexity.

Where SlidingServe works best:

  • Long-prompt workloads (arXiv-Summarization), where chunked prefill spans many iterations and the two-step optimization pays off
  • Heterogeneous mixed workloads with diverse TTFT/TBT SLOs
  • Moderate-to-heavy load (QPS near the goodput point), where violation risk is high enough to make BatchConstructor valuable

Interaction with KV cache eviction under memory pressure. When GPU memory is full and new requests cannot be admitted due to KV cache exhaustion, systems like vLLM use KV cache eviction (swapping running requests to CPU memory or recomputing). SlidingServe does not model memory pressure in its scheduling loop — its SlidingChunker and BatchConstructor operate on the assumption that all queued requests have their KV cache resident. Under extreme memory pressure, the system may make scheduling decisions that conflict with memory management, potentially causing unexpected preemption mid-batch.

Where SlidingServe degrades or may not help:

  • Pure short-prompt workloads (e.g., chatbots with ≤100 token prompts): chunked prefill completes in one iteration, so SlidingChunker degenerates to single-step scheduling and provides no benefit. In this regime, SlidingServe’s overhead (predictor calls, ternary search) is pure cost with no throughput benefit. The paper does not provide latency breakdowns to quantify this overhead.
  • Pure decode-heavy workloads (long generation, short prompts): the Waiting Queue is nearly empty, so MLPS and BatchConstructor have little to optimize.
  • Very tight SLOs (TTFT ≤ 500 ms with heavy load): at extreme load, even BatchConstructor cannot prevent violations because nrn_r approaches ncn_c and the DP produces trivially small batches.
  • Highly variable output lengths: the TBT deadline model (Eq. 1) assumes the request will eventually complete; if output length is highly variable, the per-token deadline structure may misallocate resources late in a long generation.
  • Multi-node distributed setups (TP > 2, PP > 1): The system is built on TP2 only. Pipeline parallelism introduces bubble overhead that the predictor’s features do not model.

7.3 Goodput Across Datasets (Annotated)

The goodput improvement varies significantly across dataset types, which provides diagnostic insight:

  • ShareGPT (short prompts, dialogue): +25–35% vs. Sarathi-EDF; +9.7–12% vs. QoServe. The MLPS contribution is minimal because short prompts rarely linger in the prefill queue long enough for urgency to differentiate. SlidingChunker’s benefit is also limited since most prefill completes in a single chunk.

  • arXiv-Summarization (long prompts, single-document summaries): +80–111% vs. Sarathi-EDF; +22–30% vs. QoServe. The large gains reflect the scenario SlidingServe is most optimized for: long prefill that must be chunked across many iterations, where the two-iteration joint optimization materially reduces the total number of iterations needed.

  • Mixed-v1/v2 (composite): +40–60% vs. Sarathi-EDF; +15–20% vs. QoServe. The mixed workload exercises all three components simultaneously: MLPS differentiates urgency across heterogeneous prompts, SlidingChunker optimizes the overall budget, and BatchConstructor handles the spike cases where a slow arXiv request blocks a time-sensitive ShareGPT request.

The asymmetry between the +9.7% minimum gain (short prompts) and +30% maximum gain (long prompts) suggests that SlidingServe’s value proposition is strongest in workloads dominated by long-context requests — which includes document Q&A, code review, and summarization tasks that are increasingly common in production.

7.4 Transient Load Dynamics: A Scheduling-Theoretic Perspective

The transient overload experiment (alternating QPS=1.0 and QPS=2.5 every 2 minutes) reveals an important system property: reaction speed to load changes. When load spikes from 1.0 to 2.5 QPS:

  1. The Waiting Queue rapidly fills with new requests.
  2. MLPS re-evaluates urgency at every iteration, immediately reordering based on current slack.
  3. SlidingChunker’s window constraint TnextT_{\text{next}} automatically tightens as the decode queue grows (more decode requests → smaller sminDs_{\min}^D), limiting the chunk size and controlling TTFT growth.

This closed-loop response is faster than QoServe’s proactive relegation, which requires maintaining a separate “relegated request” state that may lag the actual load dynamics. SlidingServe’s reaction time is bounded by a single iteration (\approx20–50 ms), making it inherently adaptive.

11. Conclusion

SlidingServe is a well-engineered system that addresses a real and important problem in LLM serving. Its core contribution — treating chunk-size selection as a two-iteration joint optimization rather than a single-step greedy decision — is both novel and theoretically well-motivated. The supporting infrastructure (lightweight predictor, lexicographic priority sorter, DP batch constructor) is carefully designed and the ablation study gives genuine insight into each component’s contribution.

Connection to real-time scheduling theory. The problem SlidingServe solves has deep connections to real-time systems theory. The TBT deadline model (Eq. 1) is essentially a periodic real-time task model where each request generates a task with period LitbtL_i^{\text{tbt}} and deadline equal to the period. In classical real-time scheduling, Earliest Deadline First (EDF) is optimal for periodic tasks on a single processor. The key difference in LLM serving is that: (a) tasks are not periodic in arrival (requests arrive randomly), (b) the “processor” (GPU batch) has variable capacity depending on batch composition, and (c) task execution time is not fixed but depends on the batch (the mutual interference problem). These deviations from the classical RT model are precisely what makes the problem hard and what SlidingServe’s custom algorithms address. Future work could explore whether provable scheduling guarantees from RT theory can be ported to this setting under simplifying assumptions.

Open problems for future work. Several interesting directions emerge from this paper:

  1. Window size beyond 2. The ternary-search framework naturally extends to window size k>2k > 2: jointly optimize b1,,bkb_1, \ldots, b_k subject to bi=BΣ\sum b_i = B_\Sigma and f(bi)Tif(b_i) \leq T_i for each ii. This becomes a kk-dimensional constrained optimization, tractable with coordinate descent or gradient methods if k4k \leq 4.

  2. Learned latency model. The 7-feature linear predictor works well for fixed hardware, but a small neural network (2–3 layers, < 10K parameters) could capture interactions between features and generalize across GPU families. With modern ONNX runtime, inference latency would still be < 10 μs.

  3. Preemption-aware scheduling. SlidingServe does not support request preemption (temporarily swapping out a running request’s KV cache to disk/CPU). Integrating preemption with the DP solver could further reduce violation rates: when no feasible batch exists that satisfies all deadlines, preempt the lowest-value request rather than violating.

  4. Multi-model serving. Production clusters increasingly serve multiple models on shared GPU resources. The MLPS urgency computation and SlidingChunker’s window constraints would need to account for cross-model resource contention and model-switching overhead.

What I found most interesting is the BatchConstructor formulation. The idea that request selection within a batch should be treated as a knapsack problem, with value functions that explicitly trade off urgency and remaining workload, is elegant and generalizable. This formulation could be extended to more complex SLO structures or multi-tenant scenarios.

Reproducibility. The paper uses publicly available datasets (ShareGPT, arXiv-Summarization from HuggingFace) and standard hardware (RTX 3090). However, the implementation is not open-sourced as of the submission date. The key components — the predictor training code, the SlidingChunker ternary search, and the BatchConstructor DP — are described at an algorithmic level sufficient for reproduction, but without an open codebase, independent verification is not possible. This is a common limitation of systems papers, but it is worth noting given the competitive nature of the serving-systems field.

Ethical and societal considerations. SlidingServe enables more efficient use of GPU resources for LLM serving, which has a dual-use dimension: it lowers the marginal cost of serving LLMs, potentially democratizing access, but also enabling higher-throughput deployment of models for surveillance, misinformation generation, or other harmful applications. The paper does not address this, which is standard for systems papers but increasingly noted by reviewers.

From a research-methodology perspective, SlidingServe exemplifies the “formalize-then-solve” approach to systems design: identify the informal observation (greedy is locally optimal but globally suboptimal), formalize it as a mathematical objective (Eq. 22), and provide an efficient algorithm (ternary search) with a well-understood complexity profile. The BatchConstructor follows the same pattern. This methodology is more robust than ad-hoc heuristics and produces systems whose behavior can be reasoned about analytically.

The most significant open question is scale: does this approach generalize to A100/H100 clusters, larger models (70B+), and more complex deployment topologies? The current evaluation is too narrow to answer this definitively. Future work should also explore whether a window size greater than 2 provides meaningful gains and whether the predictor can be made architecture-agnostic through better feature design. Despite these limitations, SlidingServe is a meaningful step forward in principled SLO-aware LLM serving.

Summary of contributions revisited. In terms of novelty:

  • The Batch Latency Predictor is not novel in concept (linear regression on hardware counters is standard in systems), but the specific 7-feature design for LLM batches is well-motivated and practically effective.
  • The Multi-Level Priority Sorter is a clean formalization of existing ad-hoc ideas (urgency + fairness + workload) into a lexicographic key, making it principled and tunable.
  • The SlidingChunker is the most novel contribution: framing chunk-size selection as a two-iteration joint optimization and solving it with ternary search is a concrete algorithmic advance over the state of the art.
  • The BatchConstructor is also novel: treating intra-batch request selection as a 0/1 knapsack is a new formulation that addresses a problem not previously identified in the LLM serving literature.

Together, these four components represent a coherent and principled system design that advances the state of the art in SLO-aware LLM scheduling. The paper is recommended for publication with minor revisions (primarily: broader hardware evaluation, computational overhead analysis, and full comparison with 2025 SLO-aware baselines).

Broader impact on the LLM infrastructure ecosystem. SlidingServe’s predictor-driven approach creates a new design pattern for LLM schedulers: collect hardware profiling data, train a lightweight model, and use that model to make scheduling decisions that would otherwise require expensive simulation or heuristic approximation. This pattern is generalizable: the same approach could be used for admission control (predict whether a new request can be admitted without violating existing SLOs), KV cache prefetching (predict which KV pages will be needed in the next 2 iterations), or preemption decisions (predict which request’s eviction will minimally impact goodput).

Personal assessment. SlidingServe is a technically solid paper with a clear problem statement, well-motivated algorithms, and convincing empirical results. The decision to use exact DP rather than approximations, and ternary search rather than gradient methods, reflects a mature engineering judgment about the problem scale. The main weakness is the narrow evaluation scope, which limits the impact of the claims. For practitioners deploying LLM serving at scale, SlidingServe’s ideas — especially the two-iteration joint optimization and the intra-batch DP — are directly applicable and worth implementing even if full SlidingServe adoption requires re-validation on different hardware.


Reviewed by Zhongzhu Zhou on 2026-06-07. Paper: arXiv:2606.05933v1. Hardware tested by authors: 2× NVIDIA RTX 3090 (TP2). Models: Llama3-8B, Qwen2.5-7B. Baselines: Sarathi-EDF, QoServe (ASPLOS 2026). Datasets: ShareGPT, arXiv-Summarization-v1/v2, mixed-v1, mixed-v2.