Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization

Review date: 2026-07-02 Review author: Zhongzhu Zhou Paper reviewed: Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization Paper authors: Yanda Tao, Ana Klimovic arXiv: 2606.16907 Status/Venue: arXiv preprint, June 2026 (ETH Zürich, cs.DC)

Short Answer

Every organization that trains large language models faces the same hardware reality: GPU clusters are not monolithic. A cluster bought two years ago has H100 nodes; today’s procurement adds Blackwell nodes; somewhere in the rack there are older A100 nodes still running because they paid for themselves long ago. Designing a training job that efficiently uses all of these simultaneously is an unsolved problem — existing auto-parallelizers are either homogeneous (they pretend all GPUs are the same and produce poor plans) or heterogeneous (they know about GPU differences but must sacrifice feature coverage — no expert parallelism, no ZeRO, no context parallelism — to keep the search tractable).

Tangram’s thesis is that these two goals — supporting a full feature set and handling heterogeneity — do not have to trade off. The key insight is that parallelization planning and heterogeneity handling can be decoupled: a homogeneous parallelizer is given a “GPU island” (a view of only the similar GPUs in a cluster), plans a slice of the model for that island, and Tangram’s outer layer composes the resulting partial plans into a globally-balanced pipeline across islands. The result is a system that plugs into existing parallelizers through a narrow API, inherits all their features, and achieves up to 2.3× higher training throughput than today’s heterogeneous baselines.

Prerequisites: What You Need to Know First

Before diving into Tangram’s design, it helps to have a working mental model of the four problems it sits at the intersection of.

1. LLM Training Iteration: Memory and Compute Breakdown

Each training step processes a global batch of tokens. The forward pass runs from layer 1 to layer LL, producing intermediate tensors at each layer. The backward pass recomputes gradients from layer LL to layer 1 using the chain rule. An optimizer (usually Adam) then applies parameter updates.

The GPU memory footprint per device has four components:

Mdevice  =  Mparams+Moptimizer+Mactivations+Mgrad(1)M_{\text{device}} \;=\; M_{\text{params}} + M_{\text{optimizer}} + M_{\text{activations}} + M_{\text{grad}} \tag{1}

Let us size each term for a 70B dense LLM in BF16 with Adam:

  • Parameters (MparamsM_{\text{params}}): 70×109×2bytes=140GB70 \times 10^9 \times 2\,\text{bytes} = 140\,\text{GB} in BF16.
  • Adam optimizer states (MoptimizerM_{\text{optimizer}}): Adam stores first moment mm and second moment vv in FP32, so 70×109×2×4bytes=560GB70 \times 10^9 \times 2 \times 4\,\text{bytes} = 560\,\text{GB}.
  • Gradients (MgradM_{\text{grad}}): same shape as parameters, stored in FP32 or BF16 depending on mixed-precision recipe. FP32: 280 GB.
  • Activations (MactivationsM_{\text{activations}}): depend on batch size, sequence length, and whether activation recomputation is used. For a batch of 8 with sequence length 4096, activations can reach 80–200 GB depending on recomputation policy.

Total naive requirement: roughly 980–1180 GB — an order of magnitude larger than a single H100’s 80 GB HBM. We have no choice but to distribute across GPUs. The question is how to distribute, which is precisely what a parallelization plan specifies.

2. Multi-Dimensional Parallelism (MDP): The Design Space

Modern LLM training systems combine five orthogonal parallelism strategies, each targeting a different dimension of the memory/compute problem:

StrategyAbbrev.PartitionsCommunication PatternMain Cost
Data parallelismDPInput batch across replicasAllReduce of gradients (end of step)Gradient sync bandwidth
Tensor parallelismTPWeight matrices column/row-wiseAllReduce per layerPer-layer bandwidth
Pipeline parallelismPPLayers into sequential stagesPoint-to-point activationsPipeline bubble
Context parallelismCPInput sequence (position dim)Ring AllReduce for attentionAttention cross-sequence comm
Expert parallelismEPMoE expert assignments to GPUsAlltoAll for token routingExpert load imbalance

These strategies interact in non-trivial ways. For example:

  • TP requires high-bandwidth intra-node links (NVLink) because it communicates every layer; it is rarely used across slow inter-node links.
  • PP is communication-lightweight between stages (only activation tensors cross stage boundaries) but introduces pipeline bubbles during fill and drain.
  • DP gradient synchronization can be overlapped with backward computation using asynchronous AllReduce.
  • ZeRO (Rajbhandari et al., 2020) partitions optimizer states, gradients, and parameters across DP replicas, reducing per-GPU memory by up to 1/NDP1/N_{\text{DP}} at the cost of extra collective communication during the parameter update step.

A parallelization plan assigns specific parallelism degrees to all of these strategies (e.g., TP=4, PP=2, DP=8, EP=4, ZeRO=2) and specifies which physical GPU handles which model partition (the placement). The plan space is the Cartesian product of all valid degree combinations times all valid placement mappings — an enormous combinatorial space even for simple homogeneous clusters.

3. Memory-Saving Techniques: The Full Toolkit

On top of the parallelism strategies, training systems apply memory-saving techniques that trade computation or bandwidth for reduced peak memory usage:

Gradient accumulation (GA): Split the global batch of BB tokens into nGAn_{\text{GA}} micro-batches of B/nGAB / n_{\text{GA}} tokens each. Process them sequentially, accumulate gradients, then perform one optimizer step per global batch. This reduces peak activation memory from O(B)O(B) to O(B/nGA)O(B / n_{\text{GA}}) without changing effective batch size.

Activation recomputation (RC): During the forward pass, discard intermediate activations instead of storing them. During the backward pass, recompute them on-the-fly. Full RC eliminates MactivationsM_{\text{activations}} at the cost of one extra forward pass (~33% compute overhead). Selective RC only recomputes the most memory-intensive activations (attention maps), reducing overhead to ~5–10%.

Activation offloading (OF): Move activations to CPU DRAM during the forward pass, prefetch them back before the backward pass uses them. Works well when the CPU–GPU PCIe bandwidth is not the bottleneck and CPU DRAM is large.

ZeRO / FSDP: Shard optimizer states across NZeRON_{\text{ZeRO}} data-parallel GPUs. ZeRO Stage 1 shards optimizer states only; Stage 2 adds gradient sharding; Stage 3 (= FSDP) further shards parameters. The per-GPU memory reduction for optimizer states is 1/NZeRO1/N_{\text{ZeRO}}.

Together, these techniques interact with parallelism choices in complex ways that make manual configuration of the optimal plan infeasible for large clusters — motivating automatic parallelizers.

4. Automatic LLM Parallelizers: The State of the Art

Given a model architecture (number of layers, hidden size, attention heads, MoE expert count) and a cluster specification (number of GPUs, per-GPU compute, memory, interconnect topology), automatic parallelizers search for the (parallelism plan, placement) pair that maximizes training throughput. The best-known approaches are:

  • Alpa (Zheng et al., 2022): Decomposes the problem hierarchically. First finds the optimal intra-operator parallelism (TP + DP) plan using integer linear programming (ILP), then finds the inter-operator (PP) plan using dynamic programming. Backend: XLA (PJRT).
  • Galvatron (Miao et al., 2022): Decision-tree search over DP/TP/PP degrees for each layer, with MoE expert parallelism support. Backend: Megatron-LM (PyTorch).
  • Aceso (Liu et al., 2024): Iterative bottleneck-alleviation beam search. Starts with a feasible plan and iteratively relaxes activation recomputation decisions at bottleneck operators. Key advantage: per-operator RC decisions, enabling much more efficient memory-throughput trade-offs. Backend: Megatron.
  • Mist (Zhu et al., 2025): MILP + enumeration. Supports activation offloading and ZeRO simultaneously.
  • HyperTron (Li et al., 2025): Simulated annealing search with CP and EP support.

All of these systems assume a homogeneous cluster: all GPUs have the same compute, memory, and interconnect characteristics. This makes the search space symmetric — any rotation of a plan across identical GPUs produces the same throughput, allowing massive pruning. For a 512-GPU homogeneous cluster, Aceso’s search takes ~30 minutes on a modern CPU.

5. The Heterogeneity Problem: Why It Breaks Everything

When GPU clusters mix different generations (A100 + H100 + H200) or even different models within a generation (H100 SXM 80 GB vs. H100 NVL 94 GB), the symmetry assumption breaks completely.

Dimension 1: Asymmetric partitioning. In a homogeneous cluster under PP, all pipeline stages can have the same number of layers because all GPUs have the same compute. In a heterogeneous cluster, more compute-rich GPUs should handle more layers to keep stages balanced. But “how many more layers?” depends on the exact compute ratio, memory ratio, and the parallelism degrees chosen within each stage — all of which are jointly determined.

Dimension 2: Placement sensitivity. Under TP, GPUs must communicate via AllReduce every layer. If TP is applied across GPUs with different interconnect bandwidth (e.g., A100 on InfiniBand vs. H100 on NVLink), the AllReduce latency becomes unpredictable and the communication bottleneck degrades throughput significantly.

Dimension 3: Search space explosion. The number of valid MDP plans for a heterogeneous cluster is roughly:

Phet  =  Phomo  ×  (GG1,G2,,GK)(2)|\mathcal{P}_{\text{het}}| \;=\; |\mathcal{P}_{\text{homo}}| \;\times\; \binom{G}{G_1, G_2, \ldots, G_K} \tag{2}

where GG is the total number of GPUs, GiG_i is the number of GPUs of type ii, and the multinomial coefficient counts the possible placements. For a 512-GPU cluster with 3 GPU types, this factor alone exceeds 101010^{10}, making the joint search intractable.

Existing heterogeneous parallelizers (Metis, Sailor) respond by simply dropping features: no CP, no EP, no ZeRO, no per-operator RC. This reduces the search space by several orders of magnitude but produces plans that are often worse than a naïve homogeneous approach (as Aceso beats Metis on heterogeneous clusters in the paper’s experiments). This is the fundamental tension Tangram resolves.

Tangram’s Architecture: The Decoupling Approach

flowchart TD
    A[Heterogeneous GPU Cluster\nH100 SXM + A100 + H200 NVLink] --> B[Step 1: GPU Island Formation\nCluster by performance characteristics]
    B --> C1[Island 0: 64× H100 SXM\n989 TFLOP/s, 80 GB NVLink]
    B --> C2[Island 1: 128× A100 80GB\n312 TFLOP/s, 80 GB IB]
    B --> C3[Island 2: 32× H200\n1979 TFLOP/s, 141 GB NVLink]
    C1 --> D[Step 2-3: Model Slice Enumeration\nwith 4-Rule Pruning]
    C2 --> D
    C3 --> D
    D --> E1[Partial Plan P0: Layers 0-19 on Island 0\nTP=4, DP=16, ZeRO=2, RC=selective]
    D --> E2[Partial Plan P1: Layers 20-59 on Island 1\nTP=8, DP=16, ZeRO=3, RC=full]
    D --> E3[Partial Plan P2: Layers 60-79 on Island 2\nTP=4, EP=8, ZeRO=2, RC=selective]
    E1 --> F[Step 4: Pipeline Composition\nDP Algorithm - Maximize Throughput]
    E2 --> F
    E3 --> F
    F --> G[Global Plan: 3-Stage Work-Balanced Pipeline\nBubble ratio < 5%]
    G --> H[Training Runtime: Megatron / DeepSpeed]

Figure 1: Tangram end-to-end architecture. A heterogeneous cluster is partitioned into GPU islands (Step 1). Model slices are enumerated and matched to islands (Steps 2–3) with aggressive pruning. Existing homogeneous parallelizers plan each (slice, island) pair. Dynamic programming composes the partial plans into a globally balanced pipeline (Step 4).

Component 1: GPU Island Formation

The first challenge is deciding which GPUs are “similar enough” to be treated as homogeneous by the inner parallelizer. Tangram defines a GPU island as a set of GPU nodes whose intra-node and inter-node performance metrics fall within a tolerable range.

Performance Metric Taxonomy

Tangram characterizes each GPU node along two axes:

Intra-node metrics (per-node scalars):

  • Peak BF16 throughput CgC_g (TFLOP/s)
  • HBM capacity MgM_g (GB)
  • HBM bandwidth Bmem,gB_{\text{mem},g} (GB/s)
  • NVLink aggregate bandwidth within the node BNVL,gB_{\text{NVL},g} (GB/s)
  • Number of GPUs per node ngn_g

Inter-node metrics (per-link scalars):

  • InfiniBand / RoCE / Ethernet bandwidth Bnet,g,gB_{\text{net},g,g'} (Gb/s)
  • Network topology (fat-tree rail count, oversubscription ratio)

Island Formation Algorithm

Algorithm 1: GPU Island Formation
----------------------------------
Input:
  G = {g_1, ..., g_N}: all GPU nodes
  metrics: C[g], M[g], B_mem[g], B_NVL[g], B_net[g,g']
  thresholds: τ_C, τ_M, τ_B (relative tolerance, e.g. 0.2 = 20%)
  merge_threshold: t_merge (max allowable bottleneck increase)

Output:
  I = {I_1, ..., I_K}: set of GPU islands

Phase 1 — Initial clustering:
  Build adjacency graph where edge (g_i, g_j) exists iff:
    |C[g_i] - C[g_j]| / max_g C[g]  ≤  τ_C
    AND |M[g_i] - M[g_j]| / max_g M[g]  ≤  τ_M
    AND |B_mem[g_i] - B_mem[g_j]| / max_g B_mem[g]  ≤  τ_B
  Initial islands = connected components of this graph

Phase 2 — Greedy merging:
  priority_queue PQ: pairs (I_a, I_b) sorted by merge_cost(I_a, I_b)
  while PQ is not empty:
    (I_a, I_b) = PQ.pop()
    bottleneck = estimate_pipeline_bottleneck(I_a ∪ I_b)
    if bottleneck ≤ t_merge * current_bottleneck:
      Merge I_a and I_b → I_new
      effective_mem[I_new] = min(M[g] for g in I_new)
      Update PQ with new merge candidates involving I_new

Return final island set I

The estimate_pipeline_bottleneck function uses the effective HBM capacity and bandwidth of the merged island to estimate whether the merged island could sustain the same per-stage throughput as the constituent islands.

Why conservative HBM estimation? When an island contains GPUs with different HBM capacities (e.g., H100 SXM 80 GB and H100 NVL 94 GB), the parallelizer plans assuming the minimum capacity (80 GB). This ensures no GPU runs out of memory during training, at the cost of wasting the extra HBM on the NVL nodes. The alternative — planning for the average capacity — risks OOM errors at runtime on the smaller nodes.

Worked Example: Three-Island Cluster

Consider a cluster with 64 H100 SXM nodes (80 GB, ~989 TFLOP/s), 128 A100 nodes (80 GB, ~312 TFLOP/s), and 32 H200 nodes (141 GB, ~1979 TFLOP/s). With tolerance τC=0.2\tau_C = 0.2:

  • H100 SXM and A100: 989312/1979=0.34>0.2|989 - 312| / 1979 = 0.34 > 0.2 → separate islands.
  • H100 SXM and H200: 9891979/1979=0.50>0.2|989 - 1979| / 1979 = 0.50 > 0.2 → separate islands.
  • H100 SXM 80 GB and H100 NVL 94 GB: 989989/989=0|989 - 989| / 989 = 0 → same island.

Result: three islands. If the cluster also had H100 NVL nodes, they would merge into the H100 SXM island, keeping the island count at 3 instead of 4.

graph LR
    subgraph Island 0 H100 SXM
    G1[Node A: H100 SXM 80GB\n989 TFLOP/s]
    G2[Node B: H100 NVL 94GB\n989 TFLOP/s]
    end
    subgraph Island 1 A100
    G3[Node C: A100 80GB\n312 TFLOP/s]
    end
    subgraph Island 2 H200
    G4[Node D: H200 141GB\n1979 TFLOP/s]
    end
    G1 -- "15% HBM diff → merge\nEffective HBM = 80GB" --> G2
    Island0 -- "IB inter-node\nPP boundary" --> Island1
    Island1 -- "IB inter-node\nPP boundary" --> Island2

Figure 2: GPU island formation example. H100 SXM and H100 NVL are merged (15% HBM difference, same compute). Three distinct GPU generations form three separate islands. Arrows show inter-island pipeline boundaries.

Component 2: Model-Slice Interface

A model slice Sa:bS_{a:b} encapsulates transformer layers aa through b1b-1 of the full LL-layer model. It exposes the same interface as a full model:

  • Input: activation tensor AinRB×T×d\mathbf{A}_{\text{in}} \in \mathbb{R}^{B \times T \times d} (batch × sequence × hidden)
  • Output: activation tensor AoutRB×T×d\mathbf{A}_{\text{out}} \in \mathbb{R}^{B \times T \times d}
  • Parameters: weight matrices for layers aa to b1b-1

This interface works because all Transformer decoder layers share the same input/output shape. The only exception is the embedding layer (layer 0), which maps integer token IDs to dd-dimensional vectors, and the final projection layer (LL), which maps from dd to vocabulary size. Tangram handles these as fixed pipeline endpoints.

The Generic Parallelizer API

# Tangram's interface to any homogeneous LLM parallelizer
def invoke_parallelizer(
    parallelizer: HomogeneousLLMParallelizer,
    model_slice: ModelSlice,      # layers a to b-1
    gpu_island:  GPUIsland,       # set of homogeneous GPUs
) -> PartialPlan:
    """
    Returns the best parallelization plan for this (slice, island) pair.
    The parallelizer sees only a homogeneous island and a sub-model;
    it can use TP, DP, PP (within-island), EP, ZeRO, RC, OF, GA freely.
    """
    return parallelizer.plan(model_spec=model_slice, device_spec=gpu_island)

@dataclass
class PartialPlan:
    parallelism: dict  # {'TP': 4, 'DP': 16, 'EP': 8, 'ZeRO': 2, 'RC': 'selective'}
    throughput_estimate: float  # tokens/sec for this stage alone
    memory_per_gpu: float       # GB, peak during training
    activation_output_shape: tuple  # (B/DP, T/CP, d) — for interface compatibility
    latency_per_microbatch: float   # seconds — used in DP composition

This narrow API is what makes Tangram’s “heterogeneity hiding” practical: the inner parallelizer never needs to be modified. Tangram simply calls invoke_parallelizer() for each pruned (slice, island) pair and collects the resulting partial plans.

Component 3: Pruning the Exploration Space

Without pruning, the number of (model-slice, GPU-island) pairs is O((LK1)K!)O\bigl(\binom{L}{K-1} \cdot K!\bigr), which grows rapidly. For L=80L = 80 layers and K=3K = 3 islands, this is (792)×6=18,564\binom{79}{2} \times 6 = 18,564 pairs before considering that not all slice-to-island assignments are valid. Each pair requires a potentially expensive call to the homogeneous parallelizer (~seconds per call with Aceso’s beam search). Pruning is essential.

Rule 1: Redundant Slice Deduplication

All transformer layers in a decoder-only LLM have identical structure (same attention and MLP architecture, same hidden dimensions). Therefore, for any given GPU island IkI_k:

PartialPlan(Sa:b,Ik)  =  PartialPlan(Sa:b,Ik)ifba=ba(3)\text{PartialPlan}(S_{a:b},\, I_k) \;=\; \text{PartialPlan}(S_{a':b'},\, I_k) \quad \text{if}\quad b-a = b'-a' \tag{3}

The optimal plan for “any 20 consecutive layers” on island IkI_k is the same regardless of which 20 layers we pick (because all layers are structurally identical). This reduces the dimension over layer slices from O(L2/2)O(L^2/2) to O(L)O(L) distinct layer-count values.

Impact: This single rule typically reduces the number of parallelizer invocations by O(L/K)O(L / K). For L=80,K=3L=80, K=3: from ~18,000 pairs to ~240 pairs.

Rule 2: Unbalanced Slice Pruning

Pipeline throughput is determined by the slowest stage. If island IkI_k has compute capacity Ck=gIkCgC_k = \sum_{g \in I_k} C_g, then the balanced assignment gives it a share of the model proportional to its compute:

Sk  =  LCkjCj(4)|S_k|^* \;=\; L \cdot \frac{C_k}{\sum_j C_j} \tag{4}

If the proposed slice has Sa:b=ba|S_{a:b}| = b - a layers such that

Sa:bSk>βorSa:bSk<1β(5)\frac{|S_{a:b}|}{|S_k|^*} > \beta \quad \text{or} \quad \frac{|S_{a:b}|}{|S_k|^*} < \frac{1}{\beta} \tag{5}

for some imbalance tolerance β\beta (paper uses β=2\beta = 2), then this assignment will cause a severe pipeline bubble: the overloaded island finishes β×\beta\times later than average, causing all other islands to idle. Such pairs are discarded without calling the parallelizer.

Why not use β=1\beta = 1 for strict balance? Strict balance would be ideal but might prune valid plans: if a smaller island benefits from ZeRO-3 with full activation recomputation (saving memory), it might be able to handle more layers than compute alone suggests. β=2\beta = 2 is a loose enough tolerance to keep such plans while still eliminating obviously pathological assignments.

Rule 3: Memory Infeasibility Check

A (slice, island) pair is infeasible if the model slice cannot fit on the island’s GPUs even under maximum memory savings:

(ba)MlayerIkMk(1ϵsafety)>1Zmax(6)\frac{(b-a) \cdot M_{\text{layer}}}{|I_k| \cdot M_k \cdot (1 - \epsilon_{\text{safety}})} > \frac{1}{Z_{\text{max}}} \tag{6}

Here MlayerM_{\text{layer}} is the parameter memory per layer (in BF16, for a 70B model with hidden size d=8192d=8192: roughly 81922×1270×10911.4GB/layer\frac{8192^2 \times 12}{70 \times 10^9} \approx 11.4\,\text{GB/layer}), MkM_k is the per-GPU HBM of island kk, ϵsafety=0.1\epsilon_{\text{safety}} = 0.1 reserves 10% for activations and scratch buffers, and ZmaxZ_{\text{max}} is the maximum ZeRO degree (3 for ZeRO-3/FSDP). Pairs that violate this constraint are discarded immediately, before invoking the expensive parallelizer.

Rule 4: Interface Compatibility Check

For two adjacent partial plans pl:l,kp_{l':l,k} and pl:l,k+1p_{l:l'',k+1} to compose into a valid pipeline stage boundary, their interface must be compatible. The most common incompatibility is a TP degree mismatch:

  • If plan pkp_k uses TP=4, its output activation is sharded across 4 GPUs per TP group.
  • If plan pk+1p_{k+1} uses TP=8, it expects the input activation on 8 GPUs.
  • The transition requires an all-gather on the TP=4 output + a re-scatter on the TP=8 input.

Tangram allows such interface transitions by inserting an explicit all-gather/scatter at the boundary, but charges the cost in the throughput estimate. Pairs where the interface cost exceeds a threshold are discarded.

flowchart LR
    A["All combinations\nO(L^K) pairs"] --> B["Rule 1: Deduplicate\nidentical layer counts\n÷ L/K factor"]
    B --> C["Rule 2: Balance check\nDiscard ±β imbalance\n÷ large fraction"]
    C --> D["Rule 3: Memory check\nInfeasible → discard\n÷ moderate factor"]
    D --> E["Rule 4: Interface check\nHigh transition cost → discard"]
    E --> F["Pruned set\n~1/26 of original\nParallelizer invocations"]

Figure 3: Four-rule pruning cascade. Each rule progressively reduces the search space. Together they achieve a 26× reduction in parallelizer invocations with zero loss in plan quality (proven by the paper’s ablation: all four rules applied = 1× planning time, same optimal plan).

Component 4: Pipeline Composition via Dynamic Programming

After pruning and parallelizer invocations, we have a set of partial plans {pa:b,k}\{p_{a:b,k}\}. The composition problem is: choose a set of non-overlapping plans {pa0:a1,0,pa1:a2,1,,paK1:L,K1}\{p_{a_0:a_1,0},\, p_{a_1:a_2,1},\, \ldots,\, p_{a_{K-1}:L,K-1}\} whose slices cover the full model, maximizing overall training throughput.

Pipeline Throughput Model

With nstagesn_{\text{stages}} pipeline stages and nmicron_{\text{micro}} micro-batches per global batch, the pipeline cycle time is:

Tcycle=(nmicro1)τmax+k=0nstages1τk(7)T_{\text{cycle}} = (n_{\text{micro}} - 1) \cdot \tau_{\max} + \sum_{k=0}^{n_{\text{stages}}-1} \tau_k \tag{7}

where τk\tau_k is the per-micro-batch compute time of stage kk and τmax=maxkτk\tau_{\max} = \max_k \tau_k. The first term models the pipeline’s steady state (every τmax\tau_{\max} seconds, one micro-batch completes). The second term is the pipeline fill and drain overhead.

Training throughput is:

Throughput=nmicroBglobal/nmicroTcycle=BglobalTcycle(8)\text{Throughput} = \frac{n_{\text{micro}} \cdot B_{\text{global}} / n_{\text{micro}}}{T_{\text{cycle}}} = \frac{B_{\text{global}}}{T_{\text{cycle}}} \tag{8}

where BglobalB_{\text{global}} is the total number of tokens in a global batch.

Bubble ratio analysis. The wasted fraction of compute due to pipeline imbalance is:

BubbleRatio=(nstages1)τmaxnmicroτmax+kτknstages1nmicro(9)\text{BubbleRatio} = \frac{(n_{\text{stages}} - 1) \cdot \tau_{\max}}{n_{\text{micro}} \cdot \tau_{\max} + \sum_k \tau_k} \approx \frac{n_{\text{stages}} - 1}{n_{\text{micro}}} \tag{9}

(the approximation holds when stages are roughly balanced). This formula reveals the classic trade-off in pipeline parallelism: more micro-batches (nmicron_{\text{micro}}) dilute the bubble ratio, but also increase the memory pressure from live activations in the pipeline. The optimal nmicron_{\text{micro}} balances these two forces.

For Tangram’s heterogeneous pipeline specifically, if one stage is rr times slower than average:

τmax=rτˉ,τˉ=kτknstages(10)\tau_{\max} = r \cdot \bar{\tau}, \quad \bar{\tau} = \frac{\sum_k \tau_k}{n_{\text{stages}}} \tag{10}

then the bubble ratio becomes approximately (nstages1)r/nmicro(n_{\text{stages}} - 1) \cdot r / n_{\text{micro}}. Tangram’s DP algorithm minimizes rr (the imbalance ratio) by choosing slice sizes proportional to each island’s compute capacity.

Dynamic Programming Formulation

Let dp[l][k]\text{dp}[l][k] = maximum throughput achievable by partitioning layers 0l0 \ldots l across the first kk GPU islands.

Base case:

dp[0][0]=+(no layers assigned, no time spent)(11)\text{dp}[0][0] = +\infty \quad \text{(no layers assigned, no time spent)} \tag{11}

Recurrence:

dp[l][k]=max0l<l,  pl:l,k exists  UpdateThroughput(dp[l][k1],  τ(pl:l,k))(12)\text{dp}[l][k] = \max_{0 \leq l' < l,\; p_{l':l,k} \text{ exists}} \; \text{UpdateThroughput}\bigl(\text{dp}[l'][k-1],\; \tau(p_{l':l,k})\bigr) \tag{12}

where UpdateThroughput(prev_tp, new_stage_latency) computes the new pipeline throughput given the addition of a new stage with latency τ\tau:

UpdateThroughput(Tprev,τnew)=Bglobal(nmicro1)max(τmaxprev,τnew)+τsumprev+τnew(13)\text{UpdateThroughput}(T_{\text{prev}}, \tau_{\text{new}}) = \frac{B_{\text{global}}}{(n_{\text{micro}} - 1) \cdot \max(\tau_{\max}^{\text{prev}}, \tau_{\text{new}}) + \tau_{\text{sum}}^{\text{prev}} + \tau_{\text{new}}} \tag{13}

Full DP algorithm:

Algorithm 2: Pipeline Composition DP
--------------------------------------
Input:  partial_plans[a:b][k]  for each valid (slice, island) pair
        L = total layers, K = number of islands
        n_micro = micro-batch count per global batch

Initialize:
  dp[l][k] = -infinity  for all (l, k)
  dp[0][0] = +infinity
  tau_max[0][0] = 0
  tau_sum[0][0] = 0
  parent[l][k] = None

Main DP:
  For k = 1 to K:
    For l = 1 to L:
      For l' = 0 to l-1:
        if partial_plans[l':l][k] exists and is feasible:
          tau_new = partial_plans[l':l][k].latency_per_microbatch
          new_tau_max = max(tau_max[l'][k-1], tau_new)
          new_tau_sum = tau_sum[l'][k-1] + tau_new
          # Compute new throughput via Equation (13)
          new_tp = B_global / ((n_micro - 1) * new_tau_max + new_tau_sum)
          if new_tp > dp[l][k]:
            dp[l][k] = new_tp
            tau_max[l][k] = new_tau_max
            tau_sum[l][k] = new_tau_sum
            parent[l][k] = (l', k-1)

Backtrack from parent[L][K] to recover the optimal slice assignment
Return dp[L][K] as the maximum throughput

Complexity: O(L2K)O(L^2 K) time, O(LK)O(LK) space. For L=80,K=3L = 80, K = 3: 802×3=19,20080^2 \times 3 = 19,200 DP cells — trivially fast (< 1 second on a CPU).

Why pipeline parallelism for inter-island composition?

This design choice deserves a detailed justification. The alternative would be to allow tensor parallelism to span across island boundaries (cross-island TP). Here is why Tangram rejects it:

  • Cross-island TP requires AllReduce after every layer. For a 70B model with d=8192d = 8192, each AllReduce transfers 2×8192×TP_degree×batch_size2 \times 8192 \times \text{TP\_degree} \times \text{batch\_size} bytes. At 128 Gb/s InfiniBand (inter-island), a single AllReduce for batch size 4096, TP=8 takes about 2×8192×8×4096×2/128e967ms2 \times 8192 \times 8 \times 4096 \times 2 / 128e9 \approx 67\,\text{ms}. There are 80 layers, so this adds 80×67=5.4seconds80 \times 67 = 5.4\,\text{seconds} of pure communication per step — dwarfing the compute time.
  • Cross-island PP needs only one activation transfer per stage boundary, transferring the activation tensor of shape [B/DP,T/CP,d][B/DP, T/CP, d]. For B/DP=1,T/CP=4096,d=8192B/DP = 1, T/CP = 4096, d = 8192 in BF16: 4096×8192×2=67MB4096 \times 8192 \times 2 = 67\,\text{MB} per micro-batch. At 128 Gb/s: 67MB×8/128e94.2ms67\text{MB} \times 8 / 128\text{e9} \approx 4.2\,\text{ms} per micro-batch. This is 3 orders of magnitude less overhead than cross-island TP.

The asymmetry is stark: TP communicates O(d)O(d) per-layer, PP communicates O(BTd)O(B \cdot T \cdot d) per-stage (once, not per-layer). For practical training configurations where num_layers1\text{num\_layers} \gg 1, PP wins decisively for inter-island communication.

System Integration: Plugging into Existing Parallelizers

Tangram’s integration architecture is shown in Figure 4:

sequenceDiagram
    participant User as User / Training Script
    participant Tangram as Tangram Controller
    participant P as Homogeneous Parallelizer\n(e.g. Aceso / Galvatron)
    participant RT as Training Runtime\n(Megatron / DeepSpeed)

    User->>Tangram: plan(model=LLaMA-70B, cluster=het_spec)
    Tangram->>Tangram: form_islands() → [I_H100, I_A100]
    loop For each pruned (slice, island) pair
        Tangram->>P: plan(model_slice=S[a:b], gpu_island=I_k)
        P-->>Tangram: PartialPlan{TP, DP, EP, ZeRO, RC, τ(p)}
    end
    Tangram->>Tangram: compose_dp(partial_plans) → GlobalPlan
    Tangram-->>User: GlobalPlan (stage assignments + configs)
    User->>RT: initialize(model, GlobalPlan)
    RT->>RT: Allocate parameters per GPU per stage
    RT->>RT: Begin training loop
    Note over RT: Pipeline executes; each island\nruns its own parallelism config internally

Figure 4: Tangram-to-parallelizer-to-runtime interaction sequence. The key feature is that the inner parallelizer (Aceso or Galvatron) is called multiple times, once per (slice, island) pair, and sees a fully homogeneous problem each time. Tangram stitches the results together at the composition layer.

MoE Support via LAER-MoE

For Mixture-of-Experts models, Tangram integrates with Galvatron’s LAER-MoE runtime. When a model slice contains MoE layers, the parallelizer can use expert parallelism (EP) within the island. The EP AlltoAll routing communication (routing tokens to their assigned experts) occurs entirely within the island, where NVLink or high-bandwidth IB is available.

Cross-island communication at PP boundaries carries only the aggregated output activations — not the per-expert activations. This means the inter-island bandwidth requirement is the same for MoE and dense models at the same hidden dimension.

Evaluation: Key Results

Setup

  • Models: LLaMA-style 70B dense model; Mixtral-8x22B MoE; Qwen3-MoE.
  • Clusters: Real testbeds with A100 (80 GB) + H100 SXM (80 GB) nodes. Simulated 512-GPU clusters for scale experiments.
  • Baselines:
    • Homogeneous: Alpa (2022), Aceso (2024)
    • Heterogeneous: Metis (2024), Sailor (2025)
  • Metric: Training throughput (tokens/second), planning time (minutes)

Main Throughput Results: Heterogeneous Cluster

xychart-beta
    title "Relative Training Throughput — 70B Model, 512-GPU Heterogeneous Cluster (A100 + H100)"
    x-axis ["Alpa", "Aceso", "Metis", "Sailor", "Tangram"]
    y-axis "Throughput (relative to Aceso = 1.0)" 0 --> 2.5
    bar [0.95, 1.0, 0.43, 0.52, 2.3]

Figure 5: Relative training throughput on a 512-GPU heterogeneous cluster. Tangram achieves 2.3× the throughput of Metis and 2.1× of Aceso. Notably, Aceso (which ignores heterogeneity) outperforms Metis (which tries to handle it), because Aceso’s per-operator activation recomputation support more than compensates for its suboptimal placement.

Why Does Aceso Beat Metis?

This result is the paper’s most important empirical insight. Metis handles heterogeneity but drops per-operator activation recomputation (RC) to keep its search tractable. Aceso ignores heterogeneity but supports RC. In practice, RC’s benefit (avoiding memory pressure = allowing larger batch sizes = higher throughput) outweighs the benefit of optimal placement.

This implies that feature completeness is more valuable than heterogeneity-awareness. Tangram is the first system to achieve both simultaneously.

Homogeneous Cluster Results

On a fully homogeneous cluster, Tangram’s performance equals Aceso (the best homogeneous baseline) and exceeds Metis/Sailor by up to 3.1×. This occurs because Metis and Sailor have fewer supported features even on homogeneous clusters — a structural weakness unrelated to heterogeneity.

Pruning Ablation (paper §6.5)

Pruning rules appliedPlanning time (relative)Throughput
None26×same (optimal)
Rule 1 only (dedup)same
Rules 1+2 (balance)same
Rules 1+2+3 (memory)1.5×same
All four rules1× (baseline)same

All four rules together achieve a 26× speedup with no loss in plan quality, confirming that the pruned plans are a strict subset of provably suboptimal plans.

MoE Model Results

On Mixtral-8x22B with an A100+H100 cluster:

  • Metis: cannot use EP → replicates all experts (memory inefficient) → low throughput
  • Tangram with EP within islands: 1.8× over Metis

The speedup comes from two sources: (1) EP reduces memory per GPU by distributing experts across GPUs within each island, enabling larger effective batch sizes; (2) optimal placement by Tangram’s DP assigns layers proportionally to each island’s compute.

Algorithmic Complexity Analysis

GPU island formation: O(N2)O(N^2) node comparisons for initial clustering (NN = number of GPU nodes), O(K2)O(K^2) merge operations. In practice N1000N \leq 1000 and K5K \leq 5, so this is negligible.

Enumeration before pruning: O((LK1))O(\binom{L}{K-1}) slice combinations (choosing K1K-1 split points in LL layers). For L=80,K=3L=80, K=3: (792)=3081\binom{79}{2} = 3081 combinations per island permutation.

After Rule 1 (dedup): O(LK)O(L \cdot K) unique (layer-count, island) pairs — O(80×3)=240O(80 \times 3) = 240.

Parallelizer calls: 240 calls to the homogeneous parallelizer. Each call takes ~30–300 seconds for Aceso-style beam search. Total: 2–20 hours for planning. This is dominated by the parallelizer, not Tangram’s own overhead.

DP composition: O(L2K)=O(802×3)=O(19,200)O(L^2 K) = O(80^2 \times 3) = O(19,200) — under 1 second.

The planning-time bottleneck is the parallelizer calls, not Tangram’s algorithm. Caching partial plans (reusable across training runs with the same model and cluster) would reduce this to negligible overhead for repeated use.

Critical Assessment: Weaknesses & Improvements

Weaknesses and Flaws

1. Static one-shot planning with no fault tolerance. Tangram generates a single plan before training starts and that plan is immutable for the entire run. In practice, multi-week training runs on hundreds of GPUs experience GPU failures, node replacements, and topology changes. A single GPU failure in a pipeline stage halts training entirely, requiring manual checkpoint recovery and plan regeneration. The paper’s §6 explicitly notes that fault tolerance is out of scope. This is a significant gap for production deployability.

2. Pipeline throughput model ignores compute-communication overlap. Equation (7) models the pipeline cycle time assuming stage kk‘s compute and activation transfer are non-overlapping. In the standard 1F1B (one-forward-one-backward) schedule used by Megatron, the pipeline stage actually overlaps computation with communication using double-buffering: while stage kk computes micro-batch mm‘s activations, it simultaneously sends micro-batch m1m-1‘s activations to stage k+1k+1. This overlap can reduce effective pipeline latency by up to 2×2\times for communication-compute balanced stages. Tangram’s optimizer therefore underestimates throughput for configurations with fast inter-node links, and may not find the globally optimal plan.

3. Only decoder-only, uniform-architecture models are tested. The model-slice interface assumes all layers in a slice are structurally identical (same dd, same attention mechanism, no cross-attention). This holds for LLaMA, Mistral, and Qwen dense decoders but breaks for encoder-decoder models (T5, BART) and hybrid architectures. The paper does not discuss this limitation, which could mislead readers into thinking Tangram works universally.

4. The 2026 heterogeneous baselines are missing from experiments. Table 1 lists five 2026 heterogeneous parallelizers (HARP, HetAuto, HexiScale, and two others). The experimental section compares only against Metis (2024) and Sailor (2025). Among the 2026 systems, HARP supports expert parallelism and uses a similar XLA backend to Metis — it is the most direct competitor to Tangram’s core claim (“first heterogeneous parallelizer with full feature support”). Without this comparison, the paper’s positioning is incomplete.

5. Intra-island heterogeneity is silently absorbed. The greedy merge step can group GPUs with up to ~20% compute difference into the same island (threshold τC=0.2\tau_C = 0.2). The homogeneous parallelizer plans assuming uniform performance, but the slowest GPU within the island becomes a bottleneck. For a 20% intra-island compute spread, the effective throughput could be up to 20% lower than the plan estimates. The paper’s experiments use real hardware with presumably tight tolerances, but real-world clusters often have higher variance (Boost Clock, thermal throttling). No characterization of this effect is provided.

Limitations the Authors Understate

The paper presents the GPU island abstraction as broadly applicable, but it silently depends on the following structural assumptions that are not stated as limitations:

  1. Layer-uniform models: All transformer layers between islands must have identical input/output shapes. If a model has variable hidden dimensions across layers (as some recent architectures with “width scaling” do), the slice interface breaks.

  2. Independent island optimization: The DP composition assumes that the optimal partial plan for a given (slice, island) pair is independent of what the other islands do. This ignores interactions: for example, if island A uses DP=8, it produces 8× replicated gradients that need AllReduce with the other DP replicas — possibly spanning islands. The paper does not describe how cross-island DP gradient synchronization is handled.

  3. Fixed micro-batch count: The throughput model in Equation (7) treats nmicron_{\text{micro}} as a fixed parameter rather than a joint optimization variable. In practice, the optimal nmicron_{\text{micro}} depends on the stage latency profile — for a highly unbalanced pipeline, more micro-batches can dilute the bubble ratio. Tangram does not optimize nmicron_{\text{micro}} jointly with the plan.

Concrete Improvement Suggestions

A. Dynamic re-planning on GPU failure. Cache all partial plans after the initial planning phase. When a GPU fails, mark the affected island’s (slice, island) pairs as invalid and re-run only the DP composition step — which takes under 1 second. This reduces recovery from hours (full replan) to seconds. Combined with checkpoint-resume, it would make Tangram viable for production long-running jobs.

B. Compute-communication overlap in the throughput model. Incorporate the standard double-buffering overlap model from Megatron’s 1F1B schedule into Equation (7). The corrected model is:

Tcycleoverlap=(nmicro1)τmax+kmax(τkcompute,τkcomm)(14)T_{\text{cycle}}^{\text{overlap}} = (n_{\text{micro}} - 1) \cdot \tau_{\max} + \sum_k \max(\tau_k^{\text{compute}}, \tau_k^{\text{comm}}) \tag{14}

This would improve plan quality particularly for configurations where inter-island bandwidth is high (e.g., H100 → H200 via 400 Gb/s InfiniBand).

C. Joint optimization of nmicron_{\text{micro}}. Modify the DP to jointly optimize the number of micro-batches nmicron_{\text{micro}}. For each candidate pipeline assignment, compute the optimal nmicron_{\text{micro}} by solving:

nmicro=argmaxn  Throughput(n,τ0,,τK1)(15)n_{\text{micro}}^* = \arg\max_{n} \;\text{Throughput}(n, \tau_0, \ldots, \tau_{K-1}) \tag{15}

This is a single-variable optimization that can be solved exactly in O(1)O(1) given the stage latencies.

D. Broader evaluation set. Include HARP, HetAuto, HexiScale as baselines; add encoder-decoder models (T5-11B); test extreme heterogeneity (A10 + H100, ~10× compute ratio) to characterize where Tangram’s island formation breaks down.

E. Sensitivity analysis for island thresholds. Provide a figure showing how final training throughput varies as τC\tau_C is swept from 0.05 to 0.4. This would give practitioners actionable guidance on how to configure island formation for their specific cluster.

Connections to Prior Work

Tangram synthesizes three separate research traditions:

1. Automatic parallelizers (Alpa, Aceso, Galvatron, Mist, HyperTron): Tangram wraps these rather than replacing them. Every improvement to a homogeneous parallelizer — better search algorithms, new parallelism dimensions, improved cost models — automatically benefits Tangram. The relationship is composable by design.

2. Heterogeneous cluster management (Metis, Sailor, HetAuto, HARP, HexiScale): These systems take the “expand the search space” approach: enumerate over GPU placements jointly with parallelism decisions. Tangram takes the “factorize the problem” approach: separate the two concerns and compose using DP. The trade-off is that Tangram restricts inter-island communication to pipeline boundaries, while joint-optimization systems can in principle use cross-GPU TP (though they rarely do in practice due to bandwidth constraints).

3. Pipeline scheduling (GPipe, PipeDream, Megatron 1F1B, PipeDream-Flush): Tangram’s DP composition is structurally similar to GPipe’s pipeline schedule optimization, but operating over heterogeneous stages rather than identical ones. The bubble ratio analysis (Equation 9) is standard pipeline parallel theory applied to a novel context.

Future Directions and Open Problems

Tangram opens several interesting directions for follow-up work:

Direction 1: Online replanning with checkpointing. Extend Tangram’s planning layer with a lightweight monitor that detects GPU degradation or failure and triggers incremental replanning. The DP composition can be re-run in under a second; the bottleneck is re-invoking the parallelizer for affected (slice, island) pairs. A checkpoint mechanism that periodically saves model state in a format compatible with any valid re-planned configuration (not just the current one) would allow near-seamless recovery.

Direction 2: Learned cost models for faster planning. The 30–300 second per-call overhead of invoking Aceso or Galvatron is dominated by profiling (running the plan on actual hardware for a few steps). Replacing hardware profiling with a learned latency predictor (as in recent works like PrimeTuner or Habitat) could reduce per-call cost to milliseconds, making the planning phase negligible even for large LL and KK.

Direction 3: Multi-objective optimization. Tangram maximizes training throughput subject to a memory constraint. Future extensions could optimize additional objectives jointly: (a) minimize total energy consumption (power × time), which matters for datacenter operators; (b) minimize monetary cost for cloud training (different GPU types have different spot prices); (c) satisfy SLO constraints (train for at most NN days) while minimizing cost. The DP framework generalizes naturally to multi-objective optimization with Pareto-front tracking.

Direction 4: Extension to inference serving. The GPU island abstraction is equally applicable to multi-model inference serving on heterogeneous clusters. A serving system could use Tangram-style composition to assign different layers of a large model to different GPU types, with PP carrying requests through the pipeline. This extends recent work on serving disaggregation (DistServe, Llumnix) to heterogeneous hardware.

Reproducibility Notes

  • Code: Promised in the abstract as available at an anonymized GitHub repo. Not yet public as of the arXiv preprint date (June 2026). Open-source release expected upon de-anonymization.
  • Models tested: LLaMA-style 70B (architecture specified in §6.1), Mixtral-8x22B (open-weights via HuggingFace), Qwen3-MoE (open-weights via HuggingFace).
  • Hardware: ETH Zürich CSCS cluster with real A100 and H100 SXM nodes. Large-scale results supplement with simulation validated against real-hardware measurements.
  • Baselines: Alpa (GitHub: alpa-projects/alpa), Aceso (GitHub: PKU-DAIR/Hetu), Metis (GitHub: uiuc-kang-lab/iterative-transformer-xl), Sailor (GitHub: ETH-DISCO/sailor). All publicly available.
  • Reproducibility verdict: Moderate. Real-hardware reproduction requires access to a heterogeneous GPU cluster. The core algorithms (island formation, DP composition) are clearly specified and can be implemented independently. Partial reproduction with simulation should be feasible once the code is released.

GPU Generation Characteristics Reference

For context, here is the progression of NVIDIA data-center GPUs relevant to Tangram’s experimental setting:

GPUGenerationBF16 TFLOP/sHBM (GB)NVLink BWTypical Release
A100 SXMAmpere31280600 GB/s2020
H100 SXMHopper98980900 GB/s2022
H100 NVLHopper98994900 GB/s2023
H200 SXMHopper+HBM3e1979141900 GB/s2024
B200 SXMBlackwell~45001921800 GB/s2025

Note that H100 and H100 NVL differ only in HBM capacity (15% difference) — Tangram’s greedy merging would combine them into a single island with effective HBM = 80 GB. A100 vs. H100 is a 3.2× compute gap; H100 vs. H200 is a 2× gap; A100 vs. H200 is a 6.3× gap. For the extreme case (A100 + H200), Tangram’s balanced assignment would give A100 ~14% of layers — a very thin stage that is unlikely to benefit from sophisticated parallelism.

Summary

Tangram identifies and solves a real-world problem that will only grow in importance: heterogeneous GPU clusters are the norm in modern ML infrastructure, but the best auto-parallelizers assume homogeneity and existing heterogeneous alternatives sacrifice too many features to be competitive.

The core contribution — decouple parallelization planning from GPU heterogeneity via island abstraction and DP pipeline composition — is elegant and well-executed. The 2.3× throughput gain over heterogeneous baselines (Metis, Sailor) and the 26× pruning speedup are robust across model sizes and cluster configurations. The MoE experiment demonstrates a particularly important differentiator: retaining expert parallelism on heterogeneous clusters, which current systems cannot do.

The main gaps are fault tolerance (critical for production), communication-overlap modeling (affects plan optimality), and a missing comparison against 2026 heterogeneous baselines. But as a clean architectural layer that makes the best homogeneous parallelizers usable on heterogeneous hardware, Tangram is a meaningful contribution to the distributed LLM training stack.

The paper’s title — “Hiding GPU Heterogeneity” — captures the philosophy precisely. Rather than exposing GPU heterogeneity as a first-class problem to be solved jointly with parallelization, Tangram hides it: the inner parallelizer never sees it, the outer DP layer handles it cleanly through pipeline composition, and the user sees a system that just works on their mixed-generation cluster. This “hide complexity behind an abstraction” approach is a hallmark of good systems design, and Tangram executes it well.


Failure Mode Analysis: When Tangram’s Assumptions Break

Understanding where Tangram fails is as important as understanding where it succeeds. Four concrete failure modes:

Failure 1: Extreme compute heterogeneity (e.g., A10 + H100). The A10 GPU delivers ~125 TFLOP/s BF16, the H100 ~989 TFLOP/s — a 7.9× ratio. Tangram would form two islands and the balanced layer assignment would give Island_A10 roughly 11% of layers and Island_H100 roughly 89%. For an 80-layer model, Island_A10 gets ~9 layers. With only 9 layers, there are very few valid DP plans (small model slice, limited parallelism opportunities), and the stage throughput is dominated by the A10’s low performance anyway. The composed pipeline achieves little more than the throughput of the A10 stage alone. In this regime, simply excluding the A10 cluster from the job and using only H100s might be more efficient — Tangram does not help.

Failure 2: Highly irregular model architectures. A model with a 16,384-dimensional embedding layer followed by 4,096-dimensional transformer layers cannot be uniformly sliced at layer boundaries — the embedding and projection layers have different shapes, breaking the interface compatibility assumption. Tangram’s current implementation requires uniform hidden dimensions across all sliceable layers.

Failure 3: Very small models. For a 7B model on 512 GPUs, the model is small relative to the cluster, and the dominant parallelism is data parallelism with minimal model sharding. Pipeline parallelism has high bubble overhead (too few micro-batches to amortize the pipeline fill/drain). Tangram’s DP will produce a plan with K=1K=1 island (the full cluster as one stage), falling back to a standard homogeneous plan.

Failure 4: Planning time exceeds training time. For a one-time short training run (fine-tuning for a few hundred steps), the planning overhead (30 minutes to several hours for Aceso invocations) may exceed the training time itself. Tangram is designed for long pre-training runs where planning time is amortized over days or weeks of training.

Implementation Notes: Practical Considerations

Handling Cross-Island Data Parallelism

One aspect the paper handles but does not fully detail is how data parallelism is organized across island boundaries. In Tangram’s composed pipeline, each stage can independently use DP internally. The DP degree within a stage determines how many replicas of that stage run in parallel.

When two stages on different islands both use DP, their DP groups must synchronize gradients at the end of each training step. If stage A on Island 0 uses DP=8 and stage B on Island 1 uses DP=4, the DP groups do not align naturally: the 8 replicas of stage A need to sync with corresponding replicas of stage B. Tangram resolves this by requiring that the DP degrees across stages be consistent (all stages use the same total DP degree), ensuring that each DP replica contains exactly one copy of the full pipeline. This is a standard pipeline parallelism requirement and does not add new complexity, but it does constrain the DP degree selection within stages.

Interaction with Gradient Checkpointing Strategies

Activation recomputation (RC) interacts with the pipeline schedule in a subtle way. In Megatron’s 1F1B schedule, a stage must hold activations for all currently-in-flight micro-batches. For nmicron_{\text{micro}} micro-batches and nstagesn_{\text{stages}} stages, each stage holds at most nstagesn_{\text{stages}} micro-batches’ activations simultaneously. With full activation recomputation, this memory does not grow with nmicron_{\text{micro}}; without RC, activation memory scales linearly with nmicron_{\text{micro}}.

Tangram’s memory feasibility check (Rule 3, Equation 6) uses a simplified model that does not account for this pipeline depth dependency. For deep pipelines (large KK), the actual activation memory per stage may exceed the estimate, causing unexpected OOM at runtime. In practice, the paper’s experiments use K=2K = 2 or K=3K = 3 stages, where this effect is minor, but it could become significant for K5K \geq 5.

Planning Time in Practice

On a real cluster, planning time breaks down as follows:

Total planning time ≈
  Island formation: < 1 second
  Model-slice enumeration (after pruning): O(L) unique slice counts
  Parallelizer invocations: N_pruned × T_parallelizer_per_call
  DP composition: < 1 second

Where:
  N_pruned ≈ L × K / 26 (after all pruning rules)
  T_parallelizer_per_call ≈ 30s–300s (Aceso on a 64-GPU island)

For L=80,K=3L=80, K=3, after pruning: ~9 unique (layer-count, island) pairs per island = ~27 parallelizer calls. At 60 seconds per call: 27 minutes total. This is the actual planning time bottleneck, not Tangram’s own algorithms.

Caching partial plans between training runs of the same model on the same cluster reduces re-planning to under 1 second (only the DP composition needs to run again if, say, one island gains or loses nodes). This is a key operational benefit that the paper mentions but does not fully emphasize.

Why Heterogeneous Clusters Are Getting More Common

It is worth stepping back to understand why the problem Tangram solves is growing in importance, not shrinking.

Capital expenditure amortization: A100 nodes purchased in 2022 at $10,000 per GPU represent a multi-hundred-million-dollar investment at scale. Writing off these assets before they reach end-of-life (typically 5–7 years) is financially unjustifiable. Organizations will continue running mixed A100/H100/H200 clusters for years.

GPU availability constraints: In 2025–2026, demand for H200 and Blackwell GPUs exceeded supply. Organizations that needed to scale computing could not simply replace their fleet — they had to expand heterogeneously, adding new nodes alongside existing ones.

Spot instance markets: Cloud providers (AWS, GCP, Azure) offer spot/preemptible GPU instances at significant discounts, but availability varies by instance type. Training jobs that can use a mix of GPU types have a much larger pool of available capacity, reducing costs and wait times.

Research cluster sharing: Academic clusters (e.g., university HPC clusters) are built up over years through grants and donations, resulting in inherent heterogeneity. Research groups training large models on these clusters have no choice but to work with what is available.

Tangram’s contribution — enabling state-of-the-art auto-parallelization on heterogeneous clusters without sacrificing features — therefore addresses a problem that is likely to persist and grow, not a temporary edge case.

Worked Example: End-to-End Planning for a 70B Model

To make the algorithm concrete, here is a step-by-step trace of Tangram planning a 70B LLaMA model on a heterogeneous cluster with 64 H100 SXM + 128 A100 nodes.

Step 1: GPU island formation

  • Compute difference: 989312/989=68%|989 - 312| / 989 = 68\%, exceeds τC=0.2\tau_C = 0.2.
  • Two islands: I0I_0 = {64 H100 SXM}, I1I_1 = {128 A100}.
  • No greedy merging (compute difference too large).

Step 2: Balanced allocation estimate

  • I0I_0 compute: 64×989=63,29664 \times 989 = 63,296 TFLOP/s.
  • I1I_1 compute: 128×312=39,936128 \times 312 = 39,936 TFLOP/s.
  • Fraction for I0I_0: 63296/(63296+39936)=61.3%63296 / (63296 + 39936) = 61.3\%.
  • Balanced layer assignment: S0=80×0.613=49.0|S_0|^* = 80 \times 0.613 = 49.0 layers; S1=31.0|S_1|^* = 31.0 layers.
  • With β=2\beta = 2: I0I_0 accepts 25–80 layers; I1I_1 accepts 16–62 layers.

Step 3: Pruned enumeration

After Rule 1 (dedup), unique layer counts: 1, 2, …, 80 for each island — but Rule 2 restricts I0I_0 to 25–80 and I1I_1 to 16–62. Valid pairs where S0+S1=80|S_0| + |S_1| = 80:

  • S0|S_0| in [25,64][25, 64] (since S1=80S0|S_1| = 80 - |S_0| must also satisfy [16,62][16, 62]): valid range is S0[25,64]|S_0| \in [25, 64] → 40 unique pairs.

After Rules 3 and 4, suppose 35 remain. Tangram invokes the parallelizer 35 times (vs. 802/2=320080^2/2 = 3200 without pruning — ~91× reduction for this 2-island case).

Step 4: DP composition

For each of the 35 valid (slice, island) combinations, the parallelizer returns a latency estimate τ(p)\tau(p). The DP finds the split point ll^* that minimizes τ0τ1|τ_0 - \tau_1| (stage balance), which for the 63/37 compute split turns out to be around l=48l^* = 48 layers for Island 0 and 32 layers for Island 1 — close to the balanced estimate of 49/31.

Result: Two-stage pipeline with ~5% bubble ratio, full TP + ZeRO + RC support within each stage, ~2.1× over Aceso on this cluster configuration.

Reviewer’s Final Assessment

Tangram addresses a real and growing pain point. As GPU procurement increasingly involves mixing generations — buying H100s while still running A100s, adding L40S nodes for cost efficiency, or onboarding spot instances of heterogeneous types — homogeneous-only parallelizers become a significant operational bottleneck. The paper’s answer is both principled and pragmatic: don’t retrain the optimizer, just add an island abstraction layer above it and compose with DP.

The 26× search space reduction via four pruning rules is the highlight of the algorithmic contribution. Many systems papers gloss over this, but Tangram’s pruning is load-bearing: without it, O(L²) enumeration would be impractical for 100-layer models. The rules are intuitive (dedup, load balance, memory feasibility, interface compatibility) and together reduce a quadratic space to something the inner parallelizer can handle in reasonable wall-clock time.

From an implementation perspective, the decision to rely on the inner parallelizer as a black box is both a strength and a constraint. It allows Tangram to leverage years of engineering invested in tools like Alpa and Aceso, but it also means Tangram inherits their blind spots — particularly around communication latency modeling and cross-stage synchronization costs. Future iterations that expose a richer cost model interface to the inner parallelizer could significantly improve optimality without sacrificing composability.

The evaluation on 64-GPU heterogeneous clusters is convincing for the stated use case. The ~1.9–2.3× gains over baselines, and the near-parity with homogeneous oracles, validate the design. The main gap — fault tolerance and checkpoint restart under partial GPU failures — remains open, and the authors acknowledge it. For production deployments at scale, this will matter.

Overall, Tangram is a solid systems contribution: a clean abstraction, a well-analyzed search space, and empirically verified gains. It deserves attention from practitioners managing mixed GPU fleets in 2026.


Review by Zhongzhu Zhou — 2026-07-02