Review date: 2026-06-14 Review author: Zhongzhu Zhou Paper reviewed: GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving Paper authors: Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo arXiv: 2606.13501v1, 2026-06-12 Venue/status: arXiv preprint (Shanghai Jiao Tong University)
Short Answer
GF-DiT proposes a principled answer to one of the most pressing open problems in generative AI serving: how to efficiently serve workloads with highly heterogeneous computational demand across requests, stages, and time when GPU resources are expensive and fixed hardware provisioning is costly.
Existing systems such as vLLM-Omni and SGLang Diffusion assign each request a static parallel configuration for its entire lifetime. This is fine if all requests are similar and conditions are stable — but they never are in production DiT serving. A text-to-video request generating a 4-second clip at 1080p performs orders of magnitude more compute than a thumbnail generation request. The denoising stage of that same video request needs 4–8× more sequence parallelism than its text encoding stage. And the optimal degree of parallelism for any one request flips depending on whether the queue behind it holds one request or fifty.
GF-DiT’s central thesis is an OS-style insight: GPU parallelism should be managed like CPU time slices. Just as an OS preempts processes and redistributes CPU time dynamically, a DiT serving system should continuously reallocate GPU parallelism among competing requests according to their evolving needs and the system’s service objectives. This requires solving two hard engineering problems: (1) finding semantically safe points at which to change a request’s GPU allocation mid-execution, and (2) making the act of changing group membership so cheap that it can happen dozens of times per request.
GF-DiT solves problem (1) with a trajectory task abstraction that decomposes each DiT request into independently schedulable sub-tasks (encoding, each denoising step, decoding), each producing a well-defined transferable state. It solves problem (2) with group-free collectives, a novel communication primitive that eliminates NCCL communicator construction entirely, reducing group formation overhead from 778 ms to approximately 60 μs. Together these two pieces make elastic parallelism practical, and a clean policy API on top makes the system programmable for diverse service objectives.
1. Prerequisites
GF-DiT requires background in four areas: Diffusion Transformer architecture, GPU collective communication, sequence parallelism, and serving system design. I cover each carefully before tackling the paper’s contributions.
1.1 Diffusion Models and the Denoising Process
Diffusion models (Ho et al., 2020; Song et al., 2021) generate images or videos by learning to reverse a Markov chain that progressively adds Gaussian noise to data. Training teaches the model to estimate the added noise (or the clean data directly) at each noise level. At inference, generation starts from pure Gaussian noise and iteratively denoises it over steps:
where is the noise schedule, , is the learned denoiser (conditioned on text/image prompt ), is the step-dependent noise scale, and is fresh noise.
Inference cost: the denoiser must execute times per sample, where for DDIM/flow-matching samplers and up to for pure DDPM. For video generation, the latent sequence is long (often thousands of tokens after VAE encoding), making each forward pass expensive. The total FLOPs for a video generation request is roughly:
where is the latent sequence length (proportional to resolution × duration) and is model width. A 4-second 1080p video might have tokens and steps, while a 256×256 thumbnail has and — roughly a 400× difference in total compute.
1.2 Diffusion Transformers (DiTs)
The Diffusion Transformer (Peebles & Xie, 2023) replaces the U-Net backbone used in earlier diffusion models with a Transformer backbone. The key insight is that latent image or video patches, once tokenized, can be processed by a standard Transformer (optionally with AdaLN-Zero conditioning blocks for the timestep and class embedding).
Modern production DiTs (Wan, HunyuanVideo, CogVideoX, Open-Sora 2.0, FLUX, etc.) have architectures:
Input latent (image/video)
→ Patchify (e.g., 2×2 spatial patches)
→ Linear projection to d_model
→ Add positional embeddings
→ N × DiT blocks:
|- AdaLN-Zero: scale/shift from (t, c) embedding
|- Multi-head self-attention over ALL patches
|- AdaLN-Zero
|- Feed-forward network
→ Unpatchify
→ Output noise prediction ε or clean prediction x₀
The key difference from LLM Transformers is that DiT uses bidirectional (non-causal) self-attention — every latent token attends to every other token at each denoising step. This means:
- No KV-cache accumulation (unlike autoregressive LLMs)
- Full quadratic attention cost per step:
- Long sequences (4K–32K tokens for video) make parallelism critical
1.3 A Three-Stage Inference Pipeline
DiT inference follows a fixed three-stage structure for each request:
Stage 1 — Text/Image Encoding: a CLIP, T5, or similar encoder converts the text/image prompt into conditioning embeddings . This is computationally lightweight (typically 10–30× cheaper than a single denoising step) and involves short sequences. It benefits little from high degrees of GPU parallelism.
Stage 2 — Denoising Trajectory: the DiT model executes sequential steps, each applying the full Transformer to the latent sequence. This stage dominates compute: for video generation, it accounts for >95% of total inference time. It benefits strongly from sequence parallelism (SP) because long latent sequences can be distributed across GPUs.
Stage 3 — VAE Decoding: a Variational Autoencoder decoder maps the final denoised latent back to pixel space. This is moderately expensive but does not scale as aggressively with GPU count as the denoising stage (it is a convolutional or Transformer decoder over fixed-size spatial data).
The static parallelism mismatch: existing systems assign a single SP degree (e.g., SP=4 using 4 GPUs) to all three stages. Stage 1 is severely over-provisioned (wastes 3 GPUs). Stage 2 might be correctly provisioned for large videos but under-provisioned for small images. Stage 3 has yet a third preference. The single-SP-for-all approach is a compromise that excels nowhere.
1.4 Sequence Parallelism for Transformers
For a Transformer with sequence length processed on GPUs using Sequence Parallelism (SP):
- Each GPU holds tokens locally
- Self-attention requires exchanging key-value pairs across GPUs (using All-Gather or Ring-Attention)
- The FFN layer operates locally after the attention output is distributed back
The communication vs. computation tradeoff:
For short sequences, the communication overhead (All-Gather of K and V matrices) dominates and efficiency drops below — you’re adding GPUs but slowing down. For long sequences (video), the computation dominates and efficiency approaches , making SP beneficial.
This creates a key design implication: the optimal SP degree depends on sequence length (request shape). A system that can adapt SP dynamically — rather than fixing it at admission time — can exploit this relationship across diverse workloads.
1.5 GPU Collective Communication and NCCL
GPU collectives (AllReduce, AllGather, ReduceScatter, etc.) are the backbone of distributed deep learning. NCCL (NVIDIA Collective Communications Library) is the standard implementation. Key concepts:
Communicator: an NCCL communicator (ncclComm_t) is an opaque handle representing a group of GPUs that participate in collective operations together. Before any collective can execute, a communicator must be constructed via ncclCommInitRank (or ncclCommInitAll). This involves:
- Each GPU allocating shared memory for communication buffers
- Exchanging capabilities across the group (topology discovery)
- Initializing transport-layer connections (NVLink rings, PCIe trees)
- Building routing tables
This initialization is expensive — the paper measures 778 ms for forming a new NCCL communicator group in their evaluation. While this is a one-time cost in static systems (the communicator is created once at startup), it becomes a blocking overhead in dynamic systems that need to form new groups on-the-fly for elastic parallelism.
Why existing NCCL groups are incompatible with elastic parallelism: suppose a request currently uses SP=4 (GPUs 0,1,2,3) and we want to change it to SP=2 (GPUs 0,1). With standard NCCL, we would need to create a new ncclComm_t for {GPU 0, GPU 1}, which takes ~778 ms. Since a denoising step on 2 GPUs might take only 50–200 ms, the group formation time dwarfs the computation. Elastic parallelism becomes impractical.
1.6 Head-of-Line Blocking in Serving
A classic problem in network packet scheduling, Head-of-Line (HoL) blocking occurs when a large item at the head of a queue prevents smaller items behind it from being serviced. In DiT serving, a long video generation request (e.g., 2 minutes of computation) assigned SP=4 on GPUs 0–3 blocks all 4 GPUs from serving shorter image requests that arrive later, even though those short requests could complete in <1 second.
Static parallelism exacerbates HoL because there is no mechanism to preempt or de-prioritize the long request once started. The only solution in static systems is aggressive batching (interleaving requests), but this requires homogeneous request shapes to form efficient batches — which is exactly the heterogeneity challenge DiT workloads present.
2. System Overview: GF-DiT Architecture
GF-DiT is a runtime system that sits between the serving front-end (which receives requests and maintains the queue) and the GPU execution layer (which runs the actual DiT model). Its architecture has three main components:
graph TB
FE["Serving Front-End\n(Request Queue)"] --> TG["Trajectory Task Graph\nGenerator"]
TG --> RT["GF-DiT Runtime\n(Asynchronous Executor)"]
RT --> PI["Policy Interface\n(Pluggable Scheduler)"]
PI --> GFC["Group-Free Collectives\n(Communication Layer)"]
GFC --> GPU["GPU Workers\n(DiT Model Execution)"]
GPU -- "task completion + state" --> RT
RT -- "scheduling decision" --> PI
Figure 1: GF-DiT system architecture. Requests are decomposed into trajectory task graphs; the asynchronous runtime exposes scheduling decisions to a pluggable policy; group-free collectives handle GPU group reconfiguration at ~60 μs overhead.
The key design decision is to separate scheduling policy (what parallelism should each task use?) from runtime mechanism (how do we actually execute a task with a given parallelism?). This separation is what makes GF-DiT programmable: the same runtime substrate supports radically different scheduling objectives.
3. Core Abstraction: Reschedulable Trajectory Tasks
3.1 The Trajectory Task Graph
GF-DiT represents each DiT request as a trajectory task graph — a directed acyclic graph (DAG) where nodes are trajectory tasks and edges are artifact dependencies.
Definition (Trajectory Task): A trajectory task represents one semantically complete unit of DiT computation: either a model stage (encoding, VAE decoding, latent preparation) or a single denoising step . Completing task produces a well-defined, transferable model state.
Definition (Artifact): An artifact is a named tensor that carries information between tasks. For example:
- : conditioning embeddings produced by the encoder, consumed by all denoising steps
- : latent state produced by denoising step , consumed by step
- : final denoised latent , consumed by the VAE decoder
The trajectory task graph for a request with denoising steps looks like:
graph LR
ENC["Encode\n(text→embed)"] --> D4["Denoise\nstep T=4"]
D4 --> D3["Denoise\nstep T=3"]
D3 --> D2["Denoise\nstep T=2"]
D2 --> D1["Denoise\nstep T=1"]
D1 --> DEC["VAE Decode\n(latent→pixels)"]
ENC -- "a_embed" --> D3
ENC -- "a_embed" --> D2
ENC -- "a_embed" --> D1
Figure 2: Trajectory task graph for a 4-step denoising request. Each node is an independently schedulable task. Artifact dependencies (edges) define execution order. The embeddings are shared across all denoising steps.
3.2 Why Trajectory Task Boundaries Are Safe Rescheduling Points
A critical property of trajectory tasks is that completing one produces a semantically complete state that can be safely relocated, resharded, or resumed under a different parallel configuration.
Contrast with LLM serving: in autoregressive LLM inference, each generated token produces a valid continuation — the KV-cache grows token by token and all intermediate activations from past tokens are captured in the cache. Any token boundary is a safe scheduling point. DiT denoising is fundamentally different: within a single denoising step, the model performs bidirectional attention across all tokens simultaneously. An arbitrary mid-step interruption does not yield a semantically valid intermediate state (the latent is in a partially refined state that is not interpretable by the VAE decoder). Therefore, the only valid rescheduling points are at step boundaries.
What makes step boundaries safe: completing denoising step produces latent , a well-formed tensor that lies on the denoising trajectory. Step takes as input and is fully independent of the internal state of step (no hidden states, no ongoing computation). The runtime can:
- Save to a logical artifact
- Return the GPUs used by step to the pool
- At a later scheduling decision, allocate a potentially different set of GPUs to step
- Resume computation with no correctness loss
3.3 Execution Layouts and the Policy Interface
A scheduling decision in GF-DiT is a function from a ready trajectory task to an execution layout:
where:
- is the logical execution group — an ordered set of GPU IDs that will execute this task together
- is the parallel specification — e.g., SP=4 means sequence-parallel across 4 GPUs, CP=2 means context-parallel across 2 GPUs
A policy implements the scheduling function. GF-DiT’s unified policy interface exposes:
class GFDiTPolicy:
def schedule(self, ready_tasks: List[TrajectoryTask],
gpu_pool: GPUPool,
system_state: SystemState) -> Dict[TrajectoryTask, ExecutionLayout]:
"""
Inputs:
ready_tasks: tasks whose artifact dependencies are satisfied
gpu_pool: currently idle GPUs
system_state: queue length, per-request progress, SLO deadlines
Output:
mapping from each task to execute → execution layout
"""
raise NotImplementedError
This clean interface lets GF-DiT support diverse scheduling policies:
| Policy Type | Objective | Scheduling Logic |
|---|---|---|
| Throughput-oriented | Maximize requests/second | Assign minimum viable SP; maximize concurrency |
| Latency-oriented | Minimize mean latency | Assign maximum SP to each request |
| SLO-aware | Minimize SLO violations | Prioritize requests closest to deadline |
| Fair | Equalize per-request service rate | Round-robin GPU allocation across active requests |
| Custom | User-defined | Arbitrary Python logic over system_state |
3.4 Predictable Execution Structure
A key property that makes GF-DiT’s scheduling effective is that DiT request execution is largely predictable before it begins.
At admission time, the request specifies:
- Output resolution: (image) or (video with frames)
- Number of denoising steps:
- Sampler type (DDIM, flow-matching, etc.)
From these, the runtime can:
- Compute the latent sequence length: where is patch size and is VAE temporal compression
- Enumerate trajectory tasks: exactly encode task + denoising tasks + decode task
- Estimate per-task cost: from a profiling table indexed by
Formally, define the cost estimator:
where is the latent sequence length of task and is the SP degree. The profiling table is built offline by running each model stage at various input shapes and SP degrees. Because DiT execution is deterministic given , this estimate is accurate to within a few percent.
This predictability allows the policy to reason ahead of time about the consequences of different scheduling decisions — a crucial advantage over LLM serving where output length is unknown until generation completes.
Algorithm 1: GF-DiT Scheduling Loop (pseudocode)
Algorithm 1: GF-DiT Main Scheduling Loop
─────────────────────────────────────────────────────
1: Initialize: gpu_pool ← all GPUs, task_queues ← {}
2: while serving_active do
3: # Admission
4: for each new_request r in arrival_queue do
5: G_r ← build_trajectory_task_graph(r)
6: task_queues.add(G_r)
7: end for
8:
9: # Identify ready tasks (dependencies satisfied)
10: ready ← {τ ∈ task_queues : all artifacts of τ are available}
11:
12: # Policy decision
13: decisions ← policy.schedule(ready, gpu_pool, system_state)
14:
15: # Dispatch scheduled tasks
16: for each (τ, layout) in decisions do
17: gpu_pool.reserve(layout.group)
18: async_execute(τ, layout) # non-blocking; GPU workers execute
19: end for
20:
21: # Handle completions (event-driven)
22: for each completed task τ_done do
23: gpu_pool.release(τ_done.group)
24: publish_artifacts(τ_done) # makes outputs available
25: if τ_done.request.is_complete() then
26: deliver_response(τ_done.request)
27: end if
28: end for
29: end while
The loop is asynchronous: task dispatch (line 18) is non-blocking, and completions are event-driven. This allows the scheduler to dispatch multiple tasks to different GPU groups simultaneously and react to completions as they arrive.
4. Group-Free Collectives: Eliminating Communicator Overhead
This is GF-DiT’s most technically novel contribution. Standard NCCL requires a communicator (ncclComm_t) to be initialized before any collective can execute. Communicator creation involves topology discovery, buffer allocation, and connection setup — taking up to 778 ms in the paper’s measurements. This latency is prohibitive for elastic parallelism where groups change per task.
4.1 Root Cause Analysis: What Makes NCCL Communicator Setup Slow?
NCCL communicator initialization (ncclCommInitRank) performs the following steps:
Step 1 — Bootstrap: all ranks exchange their IP addresses and port numbers via an out-of-band bootstrap server. Each rank makes TCP connections to rendezvous.
Step 2 — Topology detection: NCCL discovers the hardware topology — which GPUs are connected by NVLink, which share a PCIe switch, which are on the same NUMA node. This involves querying the OS and the CUDA driver.
Step 3 — Transport selection: NCCL decides whether to use NVLink (fast, lower latency) or PCIe/network (slower) for each pair of ranks based on topology.
Step 4 — Ring/tree construction: NCCL builds the optimal ring (for AllReduce) or tree (for Broadcast/Reduce) topology for collective operations across the selected ranks.
Step 5 — Buffer allocation: each rank allocates pinned host memory and device memory for communication buffers.
In a static serving system, these steps happen once at startup and the overhead is amortized over thousands of requests. In an elastic system where the execution group changes every denoising step, the communicator setup cost applies to every group transition — making it the dominant bottleneck.
4.2 Group-Free Collectives: Design
GF-DiT’s insight is that for serving workloads, the topology is already known and the transport choices are fixed. The system starts up on a fixed set of GPUs with a fixed network topology. Group-free collectives exploit this: instead of building a new communicator from scratch, they use logical group descriptors that reference pre-established point-to-point channels.
Definition (Logical Group Descriptor): A logical group descriptor specifies:
- : the set of GPU IDs participating in this collective
- : the parallel specification (what SP/TP degree)
- : how the data tensor is currently sharded across the GPUs in
Instead of allocating new communication buffers and building transport trees, the collective implementation uses a pre-allocated communication pool that covers all GPU pairs in the system. Any subset of these channels can be selected at runtime using the logical descriptor.
Algorithm 2: Group-Free AllGather
Algorithm 2: Group-Free AllGather(tensor x, descriptor D)
─────────────────────────────────────────────────────────
Input: x = locally held shard (size L/P × d)
D = {G = {g₀,...,g_{P-1}}, P, layout="sequence-parallel"}
Output: y = full tensor (size L × d) on all ranks in G
1: rank ← local_rank_in(D.G) # O(1) lookup, no network call
2: y ← allocate_output(L × d)
3: # Copy local shard into position
4: y[rank*(L/P) : (rank+1)*(L/P), :] ← x
5:
6: # Exchange with other ranks using pre-established channels
7: for step in 1 .. P-1 do
8: send_rank ← (rank + step) mod P # ring-step partner
9: recv_rank ← (rank - step) mod P
10: src_slice ← (send_rank*(L/P), (send_rank+1)*(L/P))
11: # Use pre-allocated NVLink / PCIe buffer for (rank, send_rank) pair
12: async_send(channel[rank, send_rank], x, tag=step)
13: async_recv(channel[recv_rank, rank], y[src_slice], tag=step)
14: end for
15: wait_all() # barrier within D.G only
16: return y
Key difference from NCCL: line 1 (local_rank_in(D.G)) is an O(1) dictionary lookup with no network synchronization. Lines 12–13 use pre-established point-to-point channels from the global communication pool — there is no group initialization phase. The barrier at line 15 is per-group (only ranks synchronize) and uses lightweight CUDA events rather than NCCL synchronization primitives.
4.3 Latency Comparison
The measured setup overhead:
| Method | Group Formation Time | Per-Collective Overhead |
|---|---|---|
NCCL ncclCommInitRank | ~778 ms | ~0 (amortized) |
| GF-DiT group-free collectives | ~60 μs | ~5–10 μs (descriptor lookup) |
| Speedup | ~13,000× | — |
The 13,000× reduction in group formation overhead is what unlocks elastic parallelism. If a denoising step takes 100 ms on 2 GPUs, the overhead of forming a new group is now 0.06% of the step time rather than 778% — truly negligible.
4.4 Layout-Aware Artifact Migration
When the scheduled SP degree changes between two consecutive trajectory tasks, the artifact (latent tensor) must be resharded to match the new layout.
Example: task executes with SP=4 (tensor distributed as 4 shards across GPUs 0–3). Task executes with SP=2 (GPUs 0–1). Artifact must be gathered from 4 shards and then re-split into 2 shards before can execute.
GF-DiT automates this with layout-aware artifact migration:
The migration is implemented as a sequence of group-free collectives (AllGather to undo the source sharding, then ReduceScatter or slice to apply the target sharding). Importantly, the migration uses the same group-free collective infrastructure — no additional communicator overhead.
Migration cost: for a latent tensor of typical size (e.g., 1 GB for a video), the migration between SP=4 and SP=2 requires moving ~0.5 GB per GPU pair via NVLink (~1 ms at NVLink bandwidth of 600 GB/s). This is much smaller than the task execution time, so it can be overlapped with the policy scheduling decision for the next task.
5. Runtime Implementation
5.1 Asynchronous Execution Model
GF-DiT’s runtime implements a fully asynchronous execution model. The key data structures are:
ArtifactStore:
{artifact_id → (tensor_data, layout_descriptor, gpu_locations)}
TaskQueue:
{task_id → TrajectoryTask}
PendingCompletions:
{future → (task_id, completion_callback)}
The execution model is event-driven: tasks are dispatched asynchronously, and completions are handled via callbacks. The scheduling loop (Algorithm 1) runs on the CPU and is non-blocking — it never waits for a GPU task to complete before making the next scheduling decision.
This allows the system to maintain multiple tasks in flight simultaneously across different execution groups. For example:
- GPUs 0–3: executing denoising step 40 of a video request (SP=4)
- GPUs 4–5: executing text encoding of a new image request (SP=2)
- GPU 6: executing VAE decoding for a just-completed image request (SP=1)
- GPU 7: idle (available for the next scheduling decision)
graph TB
subgraph "GPUs 0-3 SP4"
VA["Video Req A: Denoise step 30 (200ms)"]
end
subgraph "GPUs 4-5 SP2"
IB["Image Req B: Encode (50ms) -> Denoise (150ms)"]
end
subgraph "GPU 6 SP1"
DC["Req C: VAE Decode (80ms) -> Idle -> New Encode"]
end
subgraph "GPU 7 SP1"
ND["Req D: Encode (50ms) -> Denoise"]
end
Figure 3: GF-DiT elastic concurrent execution across 8 GPUs. At the same instant, GPU group 0-3 (SP=4) processes a heavy video denoising step; GPUs 4-5 (SP=2) serve a medium image request; GPU 6 and 7 independently serve lightweight requests. This is impossible with static parallelism, which locks all 8 GPUs to a single configuration.
5.2 Simulation-Driven Policy Optimization
GF-DiT provides a simulation environment for evaluating and tuning scheduling policies before deployment. The simulator uses the cost estimator (Equation 5) to predict task execution times and replay request traces.
Why simulation is effective for DiTs: because DiT execution is predictable (Section 3.4), the simulator is accurate. This is in contrast to LLM serving simulators, which must model unpredictable output lengths.
The simulation workflow:
- Collect a representative request trace (workload shape distribution)
- Run the simulator with candidate policies using the profiling table
- Measure simulated throughput, latency CDF, SLO violation rate
- Select or tune the best policy before deploying to production GPUs
This is a significant practical advantage: policy development and iteration can happen offline without monopolizing GPU hardware.
5.3 SLO-Aware Scheduling Policy (Example)
To illustrate the policy interface, here is a simplified SLO-aware scheduler:
Algorithm 3: SLO-Aware Scheduling Policy
Algorithm 3: SLO-Aware Schedule
─────────────────────────────────────────────────────
Input: ready_tasks T, gpu_pool P, state S
Output: assignment: TrajectoryTask → ExecutionLayout
1: assignments ← {}
2: # Sort ready tasks by urgency (time-to-deadline)
3: sorted_tasks ← sort(T, key=lambda τ: deadline(τ) - now())
4:
5: for τ in sorted_tasks do
6: remaining_work ← Σ_{τ' after τ} ĉ(τ', 1) # min-parallelism estimate
7: time_budget ← deadline(τ.request) - now() - remaining_work
8:
9: # Binary search for minimum SP that meets deadline
10: best_SP ← 1
11: for P_candidate in [1, 2, 4, 8] do
12: if ĉ(τ, P_candidate) ≤ time_budget / remaining_steps(τ) then
13: best_SP ← P_candidate # use minimum sufficient parallelism
14: break # conserve GPUs for other requests
15: end if
16: end for
17:
18: # Allocate GPUs for best_SP if available
19: if gpu_pool.available() ≥ best_SP then
20: G ← gpu_pool.allocate(best_SP)
21: assignments[τ] ← ExecutionLayout(G, best_SP)
22: end if
23: end for
24: return assignments
The SLO-aware policy allocates the minimum SP degree that satisfies the deadline for each task, freeing excess GPUs for other requests. Under heavy load, this improves overall SLO attainment compared to a latency-minimizing policy (which always allocates maximum SP and leaves little concurrency).
6. Experimental Evaluation
6.1 Setup
GF-DiT is implemented in vLLM-Omni, an LLM-style serving framework extended to support diffusion models. The authors evaluate on representative workloads:
Models evaluated:
- Wan 2.1: a large T2V model (video generation)
- HunyuanVideo: another production T2V model
- FLUX: a T2I model (image generation)
Hardware: multi-GPU servers with NVLink-connected A100/H100 GPUs
Baselines:
- Static SP4: fixed 4-GPU sequence parallelism for all requests (existing approach)
- Static SP1: single-GPU execution for all requests
- Oracle: hypothetical optimal policy with perfect future knowledge
Workloads:
- Mixed video+image requests (heterogeneous resolution)
- Poisson arrival process with varying load (λ requests/second)
- SLO targets: P95 latency < K × minimum-request-latency
6.2 Main Results
Throughput: GF-DiT achieves up to 6.01× throughput improvement over static SP4 on mixed workloads. The gain comes from two sources: (1) small requests no longer wait behind large requests (reduced HoL blocking), and (2) GPUs freed from over-provisioned stages can serve additional concurrent requests.
Latency: mean latency reduction up to 95% over static SP4. This is primarily driven by eliminating HoL blocking — short image requests that previously waited minutes for a video request to complete now execute on free GPUs within milliseconds.
SLO attainment: SLO violation rate reduced by up to 90%. The SLO-aware policy (Algorithm 3) is particularly effective: it identifies requests at risk of deadline violation early and allocates more GPUs proactively.
Communication overhead: group-free collectives reduce communication-group setup from 778 ms to ~60 μs. Across a typical request with T=50 denoising steps and potential SP changes at each step, this saves up to 50 × 778 ms ≈ 39 seconds of overhead that would otherwise be incurred in a naive elastic implementation using standard NCCL.
pie title Throughput Share by Strategy (Normalized, SP4 baseline = 1.0x)
"Static SP1 (0.5x)" : 0.5
"Static SP4 baseline (1.0x)" : 1.0
"GF-DiT Throughput (6.01x)" : 6.01
"GF-DiT SLO-Aware (4.8x)" : 4.8
Figure 4: Throughput distribution across scheduling strategies (illustrative values from paper results). GF-DiT’s throughput-oriented policy achieves 6.01x over the static SP4 baseline. Absolute throughputs are: SP1 (0.5x), SP4 (1.0x), GF-DiT Throughput (6.01x), GF-DiT SLO-Aware (4.8x). The SLO-aware policy trades some throughput for better deadline compliance.
6.3 Stage-Level Heterogeneity Measurements
The paper’s motivating measurements (Section 2 of the paper) provide insight into why elastic parallelism is so valuable:
Encoding stage (text encoder): execution time is essentially flat as SP degree increases from 1 to 8. The computation is so lightweight that inter-GPU communication dominates — adding GPUs makes encoding slower. Optimal SP = 1.
Denoising stage (DiT model): for large video requests (L = 4096 tokens), latency decreases approximately linearly from SP=1 to SP=4, then sub-linearly to SP=8. For small image requests (L = 256), latency increases beyond SP=2 due to communication overhead. Optimal SP is request-shape-dependent.
VAE decode stage: moderate scaling, optimal at SP=2–4 depending on resolution.
A static SP=4 configuration over-provisions encoding (wastes 3 GPUs), correctly provisions denoising for large video (but over-provisions for small images), and may under-provision VAE decoding for high-resolution output.
GF-DiT handles all of these cases dynamically — each stage gets exactly the SP degree that minimizes its latency or conserves GPUs for other requests, depending on the policy.
6.4 Layout-Aware Artifact Migration Overhead
The paper measures the cost of migrating latent artifacts between SP configurations. For a typical video latent tensor (~500 MB–1 GB), migration takes 2–8 ms, which is small compared to denoising step latencies of 50–500 ms. The migration cost is further hidden by overlapping it with the scheduling decision computation on the CPU, resulting in near-zero net overhead in most configurations.
7. Comparison to Prior Work
| System | Parallelism | Scheduling | Group Reconfiguration | DiT-Specific |
|---|---|---|---|---|
| vLLM-Omni (static) | Fixed SP | Request-level | N/A | Yes |
| SGLang Diffusion | Fixed SP | Request-level | N/A | Yes |
| Alpa | Static auto-parallel | Training only | N/A | No |
| Llumnix | Migratable requests | Instance-level | VM migration | LLM only |
| GF-DiT | Elastic SP per task | Task-level | 60 μs group-free | Yes |
Versus Alpa: Alpa (OSDI 2022) automates parallelization strategy selection for training but produces a static plan — it does not adapt at runtime. Moreover, it targets training, not serving.
Versus Llumnix: Llumnix (OSDI 2024; we reviewed it earlier) enables live migration of LLM requests across serving instances. GF-DiT is complementary — it operates within a single multi-GPU deployment and adapts parallelism at task granularity, not instance granularity. Llumnix migrates requests; GF-DiT adapts how requests are executed.
Versus disaggregated prefill-decode systems: systems like DistServe (OSDI 2024) disaggregate LLM prefill and decode to separate GPU pools. DiT serving has no equivalent prefill/decode distinction (every denoising step is “decode-like”), so the disaggregation design does not directly apply. GF-DiT’s elastic scheduling is a more fine-grained solution for DiT workloads.
8. Critical Assessment: Weaknesses & Improvements
8.1 Weaknesses and Flaws
(a) Limited baseline comparison: the paper compares against two static-parallelism baselines (SP1 and SP4) but does not compare against more sophisticated alternatives:
-
Preemption-based approaches: the paper claims existing systems cannot preempt requests, but preemption could be implemented at stage boundaries using checkpoint-restore without requiring elastic parallelism. The paper does not experimentally evaluate this strawman, making it hard to assess how much of the gain comes from elastic parallelism versus simply adaptive scheduling.
-
Static multi-tier parallelism: a simple heuristic of using SP=1 for encoding, SP=4 for denoising, and SP=2 for decoding (fixed per stage but not per request) would capture much of the stage-level heterogeneity benefit identified in Section 2.3(a). The paper does not report this as a baseline, which is a notable omission given that it would be straightforward to implement in existing systems.
-
The 6.01× gain: this headline number is achieved on a specific mixed workload where static SP4 is particularly poorly matched. The paper does not report results on homogeneous workloads (all video requests, all image requests) or show how gains vary across workload distributions, making it hard to assess robustness.
(b) Evaluation hardware scope: the experiments are conducted on NVIDIA GPU clusters, and group-free collectives are described in terms of NCCL’s model. The paper does not evaluate on AMD GPUs (RCCL) or custom interconnects (e.g., Google TPU ICI, AWS NeuronLink), where the group formation overhead might differ significantly from 778 ms. The claimed 13,000× speedup in group formation is hardware-specific and may not generalize.
(c) Migration cost underreported: the paper reports “2–8 ms” for artifact migration but does not provide a breakdown of migration cost vs. denoising step time across the full range of models and resolutions evaluated. For small models at low resolution (short denoising steps of ~10 ms), migration overhead could be non-negligible. The paper does not report this regime.
(d) Policy optimality gap: the SLO-aware policy (Algorithm 3) is a greedy heuristic. The paper does not characterize the gap between GF-DiT’s policies and the Oracle bound (mentioned in Section 6.1) quantitatively. If the gap is large, there may be significant room for improvement that the paper leaves unexplored.
8.2 Limitations the Authors Understate
(a) Communicator prewarming assumption: group-free collectives assume that point-to-point communication channels between all GPU pairs are pre-established at system startup. The paper acknowledges this but does not quantify the startup cost of establishing O(N²) point-to-point channels for a large GPU cluster (e.g., 256 GPUs). At scale, this could dominate startup time and the memory footprint of the pre-allocated communication buffers could be substantial.
(b) Policy stability under bursty arrivals: the policy interface is evaluated under Poisson arrivals, which have relatively low variance in inter-arrival times. Real production DiT workloads exhibit bursty patterns (e.g., social media campaigns, gaming events). The paper does not characterize GF-DiT’s behavior under burst arrivals where many large requests arrive simultaneously, which is precisely when elastic parallelism is most needed and potentially most difficult to schedule optimally.
(c) Model heterogeneity: GF-DiT is evaluated on models from the same family (all large T2V/T2I DiTs with similar architectures). A production serving system might host multiple different DiT models simultaneously. The paper does not evaluate the overhead of switching the GF-DiT runtime between models with different weight layouts or different sharding strategies.
(d) Fault tolerance: the paper does not address GPU failure scenarios. If one GPU in an execution group fails mid-task, the GF-DiT runtime has no mechanism to recover the request’s artifact state and reassign it to a different group. This is a significant operational concern for long-running video generation jobs.
8.3 Concrete Improvement Suggestions
(a) Add a fixed-per-stage baseline: the most natural ablation is to fix the SP degree per stage (SP=1 for encode, SP=4 for denoising, SP=2 for decode) without dynamic per-task adaptation. This baseline would reveal how much of GF-DiT’s gain comes from stage-level heterogeneity versus the more fine-grained per-task adaptation. If the fixed-per-stage baseline captures 70–80% of the gain, the argument for full elastic parallelism is weakened; if it captures <30%, the fine-grained adaptation is clearly necessary.
(b) Characterize the policy-Oracle gap: run the Oracle policy (using offline knowledge of future arrivals) and report the gap to GF-DiT’s policies across multiple workload distributions. This would (i) bound the potential improvement from better policies and (ii) validate whether GF-DiT’s policies are near-optimal or have room for improvement.
(c) Extend group-free collectives to heterogeneous hardware: evaluate group-free collectives on AMD ROCm/RCCL and ensure the implementation does not rely on NCCL-specific internals. Given the growing adoption of AMD GPUs in AI infrastructure, this would significantly broaden the paper’s applicability.
(d) Add preemption as a baseline and mechanism: implement request preemption at stage boundaries as a separate mechanism and compare against GF-DiT’s elastic parallelism. Preemption would allow a long video request to yield its GPUs between stages while a short image request executes, without requiring group reconfiguration. The comparison would clarify which design choice (preemption vs. elastic SP) is more effective and whether combining them is beneficial.
(e) Failure handling: extend the artifact store to support checkpointing trajectory task states to host memory or storage after each task boundary. This would allow interrupted requests to resume after GPU failure or preemption at the cost of moderate migration overhead.
9. Broader Impact and Future Directions
GF-DiT’s core abstraction — treating parallelism as a schedulable resource rather than a static deployment decision — has implications beyond DiT serving. Several extensions are worth considering:
LLM serving with speculative decoding: speculative decoding using small draft models and large verification models alternates between two execution configurations per step. A GF-DiT-style elastic scheduler could adaptively allocate GPUs to the draft and verify stages based on acceptance rate patterns, reducing idle time in the verification stage.
Multimodal serving: production systems serve mixtures of text, image, video, and audio requests with radically different compute profiles. GF-DiT’s framework could be extended to a general multi-modal serving system where GPU allocation adapts across modality boundaries.
Training with elastic checkpointing: the trajectory task abstraction (where each step produces a transferable state) is analogous to micro-step checkpointing in training. A training system that supports checkpoint-and-resume at each micro-step could use GF-DiT’s group-free collectives to support elastic GPU allocation during training — for example, temporarily taking GPUs offline for maintenance without interrupting training.
10. Conclusion
GF-DiT makes a compelling case that GPU parallelism is a schedulable resource, not a static deployment parameter. Its two central innovations — reschedulable trajectory tasks and group-free collectives — together solve the fundamental obstacles to elastic DiT serving: finding safe preemption points and making group reconfiguration cheap enough to use per task.
The empirical results (6.01× throughput, 95% latency reduction, 90% SLO violation reduction) are impressive, though as discussed in Section 8, the evaluation scope could be broader and several important baselines are absent. The group-free collectives technique, which cuts group formation overhead from 778 ms to 60 μs, is the most technically novel contribution and the one most likely to have lasting impact — it is a genuinely useful primitive for any system that needs to dynamically compose GPU communication groups.
For practitioners building generative AI serving systems: the main takeaway is that static parallelism configurations chosen at deployment time will increasingly be a bottleneck as workload heterogeneity grows. The trajectory task abstraction and policy interface that GF-DiT provides offer a principled path toward dynamic parallelism management. The key remaining challenge is developing robust policies that adapt well under bursty, unpredictable real-world traffic — which is where most of the open research opportunity lies.
Appendix A: DDIM and Flow Matching Samplers — Why T is Small in Production
Modern production DiT models do not use the original DDPM sampler (which requires steps). Two families of accelerated samplers are ubiquitous in deployed systems:
A.1 DDIM: Deterministic Implicit Sampling
DDIM (Song et al., 2021) rewrites the reverse process as a non-Markovian deterministic ODE. The update rule is:
Setting makes DDIM fully deterministic. Because the sampler skips intermediate timesteps, can be reduced to 20–50 without significant quality loss. Each step still requires one forward pass through the full DiT model — so compute per step is unchanged, but the number of steps is drastically reduced.
Implication for GF-DiT: with , a request has 22 trajectory tasks total (1 encode + 20 denoise + 1 decode). The scheduling loop executes at most 22 times per request — enough to meaningfully adapt parallelism at each denoising step.
A.2 Flow Matching: Continuous-Time Straight Trajectories
Flow matching (Lipman et al., 2022; Liu et al., 2022) defines a vector field that maps noise to data along straight (or near-straight) trajectories in latent space. The model is trained to predict the velocity field such that:
The key advantage: because trajectories are straight, fewer steps (even 1–8 in some distilled models) are needed to integrate from noise to data. Several production models (FLUX, Stable Diffusion 3, CogVideoX) use flow matching.
Implication for GF-DiT: flow matching models with 8-step sampling have only 10 trajectory tasks — very few scheduling opportunities. GF-DiT is most beneficial for models that use 20–50+ denoising steps (DDIM) or for multi-step flow matching. This is an important practical limitation: for aggressively distilled models with 1–4 steps, the trajectory task graph is so short that elastic scheduling provides minimal benefit.
Appendix B: NVLink vs. PCIe — Why Communication Topology Matters
GF-DiT’s group-free collectives exploit pre-established point-to-point channels. The performance of these channels depends critically on the interconnect topology.
B.1 NVLink
NVLink is NVIDIA’s high-bandwidth GPU interconnect, available between GPUs within a single node on DGX/HGX systems:
| NVLink Generation | Bandwidth Per Link | Total BW (bidirectional) |
|---|---|---|
| NVLink 3.0 (A100) | 25 GB/s | 600 GB/s total |
| NVLink 4.0 (H100) | 50 GB/s | 900 GB/s total |
With NVLink, transferring a 1 GB latent tensor between 2 GPUs takes ~1.7 ms (A100) or ~1.1 ms (H100). This is why GF-DiT reports migration overhead of 2–8 ms — it is dominated by NVLink transfer for large latents.
B.2 PCIe
For GPUs on different nodes (or within a node without NVLink), the path goes through PCIe (~64 GB/s) and network fabric (RoCE/InfiniBand at 400 Gb/s = 50 GB/s). Cross-node transfer of a 1 GB latent takes ~20–40 ms, which can be non-negligible compared to denoising step times. GF-DiT’s artifact migration is thus most efficient within a single NVLink-connected server.
Implication: the 2–8 ms migration overhead reported in the paper is optimistic — it applies to NVLink-connected GPUs. For multi-node DiT serving (necessary for very large models), migration cost could be 5–20× higher. This is a gap in the paper’s evaluation (discussed in Section 8.1).
Appendix C: Worked Example — Elastic SP Scheduling for a Mixed Request Batch
To make the scheduling algorithm concrete, here is a step-by-step trace of GF-DiT scheduling a batch with two requests on an 8-GPU system.
Setup:
- Request A: 720p 4-second video, T=30 denoising steps (compute-heavy)
- Request B: 512×512 image, T=20 denoising steps (compute-light)
- Policy: SLO-aware (Algorithm 3), with Request A deadline in 60s, Request B deadline in 5s
Initial state (t=0):
- Request A admitted, trajectory tasks A-enc, A-den[30..1], A-dec created
- Request B admitted, trajectory tasks B-enc, B-den[20..1], B-dec created
- All 8 GPUs idle
- Ready tasks: {A-enc, B-enc}
Scheduling decision t=0:
Policy evaluates A-enc and B-enc:
A-enc: ĉ(A-enc, 1) = 5ms; urgency = 60s - ε
B-enc: ĉ(B-enc, 1) = 2ms; urgency = 5s - ε
SLO policy: prioritize B-enc (tighter deadline)
→ B-enc: layout = {GPU 0, SP=1} (encoding is fastest at SP=1)
→ A-enc: layout = {GPU 1, SP=1} (same: encoding at SP=1)
→ GPUs 2–7 remain idle
After encoding completes (t≈7ms):
- Ready tasks: {A-den[30], B-den[20]}
- B-enc completed; B still has 25s to deadline. A has 60s.
Scheduling decision t=7ms:
Policy evaluates:
A-den[30]: L_A = 4096 (video), ĉ(A-den, 4) = 80ms, ĉ(A-den, 8) = 50ms
time_budget = 60s / 30 steps = 2000ms per step → any SP sufficient
SLO policy: use minimum viable → SP=2 (80ms ≤ 2000ms)
B-den[20]: L_B = 256 (image), ĉ(B-den, 1) = 15ms, ĉ(B-den, 2) = 18ms (communication overhead!)
time_budget = 5s / 20 steps = 250ms per step → SP=1 optimal
→ A-den[30]: layout = {GPU 2, GPU 3, SP=2}
→ B-den[20]: layout = {GPU 0, SP=1}
→ GPUs 1, 4-7 remain idle (available for new requests)
This trace illustrates the key policy trade-offs:
- Encoding always runs at SP=1 (small request shapes, communication-dominated)
- Video denoising uses SP=2 (not SP=8) because the SLO budget is generous — conserving GPUs for concurrency
- Image denoising uses SP=1 because SP=2 would be slower for the small sequence length
- 4 GPUs remain idle and available for new arrivals
In a static SP=4 system, Request B’s encoding and denoising would wait until Request A releases its 4-GPU allocation — a wait of 80ms × 30 = 2.4 seconds just for denoising. Request B would miss its 5-second deadline. GF-DiT schedules it immediately and it completes in ≈7ms + 20×15ms = 307ms — well within its 5-second SLO.
Appendix D: Figure Summary — All Key Diagrams
Figure 1 (Section 2): GF-DiT system architecture — serving front-end → trajectory task graph generator → async runtime → policy interface → group-free collectives → GPU workers.
Figure 2 (Section 3.1): Trajectory task graph for a 4-step request — showing encoding, denoising steps, decoding, and artifact dependencies.
Figure 3 (Section 5.1): Elastic execution timeline on 8 GPUs — multiple requests executing simultaneously with different SP degrees per group.
Figure 4 (Section 6.2): Throughput comparison bar chart — GF-DiT reaches 6.01× over static SP4.
Figure 5 (Appendix C): Worked example of scheduling trace (described above in prose form).
Figure 6 (conceptual): Stage-level latency scaling vs. SP degree — encoding flat/increasing, small-image denoising degrades above SP=2, large-video denoising scales well to SP=4+.
graph LR
subgraph "Stage Latency vs. SP Degree"
direction LR
SP1["SP=1"] --> SP2["SP=2"] --> SP4["SP=4"] --> SP8["SP=8"]
end
subgraph "Encoding (L=short)"
E1["5ms"] --> E2["5ms"] --> E4["6ms"] --> E8["8ms"]
end
subgraph "Denoising (L=256, small image)"
D1["15ms"] --> D2["18ms"] --> D4["28ms"] --> D8["55ms"]
end
subgraph "Denoising (L=4096, large video)"
V1["400ms"] --> V2["210ms"] --> V4["110ms"] --> V8["70ms"]
end
Figure 6: Stage latency vs. SP degree for representative DiT workloads. Encoding is communication-bound and does not benefit from parallelism. Small-image denoising (short sequence) degrades beyond SP=2 due to overhead. Large-video denoising scales well to SP=4 and beyond. No single SP configuration is optimal for all three.
Appendix E: Request Lifecycle State Machine
To make the runtime behavior precise, here is the state machine for a single DiT request in GF-DiT:
stateDiagram-v2
[*] --> Admitted : request arrives
Admitted --> TaskReady : trajectory graph built
TaskReady --> Executing : layout assigned by policy
Executing --> TaskReady : task completes, artifacts published
Executing --> ArtifactMigrating : SP degree changes
ArtifactMigrating --> Executing : migration done, new layout applied
TaskReady --> Completed : all tasks done
Completed --> [*] : response delivered
Figure 7: GF-DiT per-request lifecycle state machine. Each trajectory task cycles between TaskReady and Executing. When the policy changes SP degree between tasks, the runtime transparently enters ArtifactMigrating to reshard the latent tensor before dispatching the next task. The whole cycle is invisible to the application layer.
References
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.
- Ho, J. et al. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
- Song, J. et al. (2021). Denoising Diffusion Implicit Models. ICLR 2021.
- vLLM-Omni: vLLM extensions for omni-modal (diffusion + language) serving.
- SGLang Diffusion: SGLang serving framework with diffusion model support.
- Zheng, L. et al. (2023). Efficiently Programming Large Language Models using SGLang. arXiv 2312.07104.
- Qiang, X. et al. (2026). GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving. arXiv 2606.13501.
- Alpa: Zheng et al. (2022). Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. OSDI 2022.
- Llumnix: Sun et al. (2024). Llumnix: Dynamic Scheduling for Large Language Model Serving. OSDI 2024.
- DistServe: Zhong et al. (2024). DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. OSDI 2024.
- Wan: Wan 2.1 T2V model, Wan-Video team.
- HunyuanVideo: Tencent HunyuanVideo T2V generation system.