May 18, 2026 EN #LLM Agent #Multi-Agent Systems #Reasoning

Why Single-Agent LLMs Beat Multi-Agent Systems on Multi-Hop Reasoning — A Budget-Controlled Story

Review date: 2026-05-18 Author: Zhongzhu Zhou Paper: Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets Authors: Dat Tran, Douwe Kiela (Stanford University) arXiv: 2604.02460v2, revised 2026-04-11 Venue: Preprint, under review

Short answer

This paper makes a surprisingly simple point that has been quietly true for a while and is finally written down carefully: once you fix the number of “thinking tokens” a system is allowed to spend on a question, single-agent LLMs (SAS) match or beat multi-agent systems (MAS) on multi-hop reasoning. The headline single-agent win is propped up by two non-obvious things — an information-theoretic argument based on the Data Processing Inequality (DPI), and a careful audit of how reasoning-token budgets are actually accounted for in practice (especially in the Gemini API, which turns out to under-spend its declared budget in single-agent mode and silently inflate MAS).

I want to spend most of this review separating three threads that the paper deliberately keeps tangled, because they are usually conflated in the literature:

Architectural claim: MAS introduces extra communication steps. By DPI, those steps can only reduce or preserve the mutual information available about the answer. So under a fixed thinking-token (not compute, not latency) budget and with perfect context utilization by SAS, the Bayes error of SAS is provably no worse than MAS.
Empirical claim: In Qwen3-30B-A3B, DeepSeek-R1-Distill-Llama-70B, Gemini-2.5-Flash and Gemini-2.5-Pro, on FRAMES (multi-hop world knowledge) and MuSiQue 4-hop, SAS (with or without a “longer-thinking” scaffold called SAS-L) consistently matches the best MAS variant — Sequential, Subtask-parallel, Parallel-roles, Debate, Ensemble — at every thinking-token budget from 100 up to 10,000 tokens (with the trivial exception of 100 tokens, where nobody is really thinking).
Methodological claim: MAS’s apparent wins in earlier literature are largely measurement artifacts: either the MAS quietly uses more thinking tokens (API budget caps that don’t actually cap), or the benchmark suffers from memorization that the paraphrasing ablation exposes, or the MAS’s “more breadth” rescues a small subset of questions but loses more by drifting.

The DPI argument is not new in spirit, but it is the first place I have seen it written down cleanly for the SAS vs MAS question, and the experimental matrix is generous enough that I trust the headline finding. The paper also identifies the boundary of the SAS-wins regime: under heavy substitution or masking of context (corrupting information rather than just shortening it), Sequential MAS does become competitive and eventually wins. This is what you would predict from the same DPI argument with a degraded effective context, and it is a satisfying tightening of an otherwise pretty bold claim.

If you build agentic systems for a living, the operational takeaway is sharp: stop using MAS as the default for multi-hop reasoning. Use SAS as the baseline, only fall back to MAS when (a) you have measured that single-agent context utilization is degraded, or (b) the task structure forces independence/specialization that no single agent can keep in its working state. The budget-controlled comparison is the only fair one.

1. Prerequisites

This review aims to be useful for someone who has built an LLM application but has not yet read carefully through the MAS-vs-SAS debate. Skim if you already know all of: DPI, “thinking-token” budgets in modern reasoning APIs, the FRAMES / MuSiQue benchmarks, and the standard MAS taxonomy (debate, ensemble, sequential planner-worker-aggregator, role specialization).

1.1 What a “thinking token” actually is

Modern reasoning-capable models (OpenAI o-series, Gemini 2.5 Flash/Pro, DeepSeek-R1, Qwen3 in thinking mode) generate two streams: a private “thinking” or “scratchpad” trajectory, then a public answer. The number of tokens spent on the private trajectory is the thinking-token budget. In Gemini 2.5 this is exposed as thinking_budget; in OpenAI o-series as reasoning_effort (low/medium/high); in vLLM/Qwen3 via stop conditions or post-hoc truncation of the <think> block.

The paper defines budget $B$ as the total tokens used for intermediate reasoning (the <think>...</think> block for open-source models, the corresponding API field for Gemini), not including the system prompt, user message, or final answer. Crucially, when comparing SAS with budget $B$ to a Sequential MAS with $k$ workers, the paper splits the budget so each worker gets $B/k$ tokens.

This is the only comparison that is even arguably fair. If a debate has two debaters and an aggregator and you give each $B$ tokens, the debate is using $\approx 3B$ tokens and you should not be surprised that it does well.

1.2 The Data Processing Inequality (DPI), in one paragraph

Suppose the truth is a random variable $Y$ , your full context is $C$ , and a messaging function $g$ produces an inter-agent message $M = g(C)$ . Then $Y \leftrightarrow C \leftrightarrow M$ is a Markov chain (because $M$ depends on $Y$ only through $C$ ), and DPI says

$I(Y;C) \geq I(Y;M).$

Equivalently, $H(Y\mid M) \geq H(Y\mid C)$ : residual uncertainty about $Y$ after observing $M$ is at least as large as after observing $C$ . No transformation of $C$ can extract more mutual information about $Y$ than $C$ itself contained. This is a textbook fact (Cover & Thomas, Ch. 2) but it has a powerful consequence: any agent whose decision is conditioned on $M$ rather than $C$ is, in the best case, no worse than chance compared to one conditioned on $C$ . The minimal achievable error probability satisfies $P_e(C) \leq P_e(M)$ .

What this argument does not say:

It does not say SAS is strictly better than MAS — equality is possible if $g$ is a sufficient statistic.
It does not say SAS will actually achieve $P_e(C)$ . Real LLMs are far from Bayes-optimal estimators; they suffer from “lost-in-the-middle” effects, attention dilution, positional bias, and context rot. The argument assumes perfect context utilization.
It does not account for compute. Multi-agent rounds spend more wall-clock and more total tokens. If the comparison is “more tokens with MAS vs fewer with SAS”, you are not measuring architecture, you are measuring compute.

The paper’s contribution at the theory level is to identify these conditions precisely, then bake them into the experimental design. The fix is to control $B$ and to allocate it identically across architectures.

1.3 FRAMES and MuSiQue — what kind of reasoning we are testing

Both are multi-hop world-knowledge QA datasets.

FRAMES (Krishna et al., 2025): questions are explicit multi-hop fact lookups with hand-written ground truths. Example: “Who wrote the song that was the encore of the artist who won the Grammy for Best New Artist in 2010?” The answer has a single canonical form.
MuSiQue (Trivedi et al., 2022), filtered to 4-hop: composed from 4 single-hop questions that share an entity bridge. Example structure: $A \to B \to C \to D \to E$ , ask “What is the $E$ of the $D$ of the $C$ of the $B$ of $A$ ?” The original MuSiQue paper showed that LLMs of the day were brittle at $\geq 3$ hops. 4-hop is genuinely hard, and even Gemini-2.5-Pro tops out around 0.45 accuracy in this paper.

A judge model scores each prediction by checking whether the gold answer is semantically present. The same rubric is used for SAS and all MAS variants, so any accuracy difference is attributable to the system, not the judge.

1.4 The MAS taxonomy used in the paper

The paper instantiates five concrete MAS designs, all under matched budget $B$ :

Architecture	Decomposition	Communication topology
Sequential	Planner → ordered workers → aggregator	Linear, each step sees prior outputs
Subtask-parallel	Planner → independent workers → aggregator	Star, workers don’t see each other
Parallel-roles	Solver / Fact Extractor / Skeptic / Second Solver → aggregator	Star with role specialization
Debate	Two debaters → critique round → judge	Bipartite with critique
Ensemble	Multiple temperature-0.7 candidates → judge	Pure majority/judge selection

Sequential is highlighted as the cleanest analogue of SAS, because both are serial reasoning over an evolving trajectory. The only difference is whether intermediate states are latent in a single chain (SAS) or externalized as messages between steps (Sequential MAS). This is the central architectural comparison.

1.5 What “single-agent with longer thinking” (SAS-L) means

To make SAS push harder on the budget, the authors add a small prefix that asks the model to:

Identify ambiguities,
Propose at least two interpretations,
Evaluate and choose one,
Then answer.

The budget $B$ is unchanged — only the user prompt is augmented. This is intended to elicit more visible thinking text, especially for Gemini, where SAS’s emitted scratchpad has been observed to plateau well below the requested budget. SAS-L matters less for Qwen3 and DeepSeek, because their <think> blocks reliably fill their budget.

2. The theory: a clean DPI for SAS vs MAS

The theoretical core (§3 of the paper) is two lemmas chained together.

2.1 Lemma 1: SAS is information-theoretically no worse than MAS under perfect context utilization

The setup is what I sketched in §1.2: $Y \leftrightarrow C \leftrightarrow M$ , where $M = g(C)$ is whatever the MAS communication channel produces. The argument:

Any estimator $\delta_M : M \to \hat{Y}$ used by MAS induces an estimator $\delta_C^{\delta_M} : C \to \hat{Y}$ via

$\delta_C^{\delta_M}(\hat{y}\mid c) = \sum_m q(m\mid c) \cdot \delta_M(\hat{y}\mid m).$

In words: simulate the MAS’s message-generation channel, then apply the same downstream rule. This induced estimator has identical joint distribution over $(Y, \hat{Y})$ as the original MAS pipeline.
The induced estimator lives in $\mathcal{D}_C$ (the set of all randomized estimators that observe $C$ ). So

$P_e(C) = \inf_{\delta \in \mathcal{D}_C} \Pr[\hat{Y}_\delta \neq Y] \leq \Pr[\hat{Y}_{\delta_C^{\delta_M}} \neq Y] = \Pr[\hat{Y}_{\delta_M} \neq Y] = P_e(M).$
Therefore $P_e(C) \leq P_e(M)$ . The single-agent system, with access to the full context $C$ , can be no worse than any MAS that operates on $M$ .

This argument is, at this level of abstraction, almost a tautology — anything you can compute in a pipeline you can also compute in a single pass that includes the pipeline. The interesting content is at the practical level, which the paper handles next.

2.2 Lemma 2: a degraded SAS context flips the regime

Real LLMs do not utilize $C$ perfectly. The paper models this with $\tilde{C}_\alpha = T_\alpha(C)$ where $T_\alpha$ is monotone in $\alpha$ (more degradation = strictly less information). Two natural Markov chains:

$Y \leftrightarrow C \leftrightarrow \tilde{C}_{\alpha_1} \leftrightarrow \tilde{C}_{\alpha_2}, \qquad 0 \leq \alpha_1 \leq \alpha_2.$

So $I(Y;\tilde{C}_{\alpha_1}) \geq I(Y;\tilde{C}_{\alpha_2})$ , and $P_e(\tilde{C}_{\alpha_1}) \leq P_e(\tilde{C}_{\alpha_2})$ .

If a MAS pipeline $g_\alpha$ extracts message $M_\alpha = g_\alpha(C)$ , the comparison is not $\tilde{C}_\alpha$ vs $C$ anymore — it is $\tilde{C}_\alpha$ vs $M_\alpha$ . Now both branches are lossy, and a sufficiently structured MAS can recover more relevant signal from the original $C$ than a degraded SAS can from $\tilde{C}_\alpha$ . The DPI bound flips direction in practice once the SAS’s effective context is bad enough.

The prediction at the experimental level: under low context degradation, SAS dominates; as degradation grows, the SAS advantage shrinks; under heavy degradation, MAS may surpass SAS. The paper verifies all three regimes in §5.3.

This is the most interesting part of the theory for me, because it gives an actionable diagnostic: if your SAS is underperforming, ask whether the bottleneck is reasoning structure (in which case MAS won’t help — it lossily compresses) or context utilization (in which case MAS might help by filtering / decomposing / verifying).

3. Method and experimental design

3.1 SAS and SAS-L

The SAS pipeline is one call. System prompt: “Think step by step, then answer. Be succinct. Return only the final answer.” The model produces a <think>...</think> block (open-source) or a thoughtSummary field (Gemini), then the answer. Final-answer extraction takes whatever follows the </think> tag.

SAS-L augments the user message with the analyze-from-multiple-perspectives scaffold described in §1.5. The thinking budget $B$ is unchanged.

3.2 Sequential MAS

Three roles:

Planner: emits strict JSON with steps $\{i, \text{name}, \text{instruction}\}$ . Receives no budget allocation in the matched-budget accounting (it is small and templated).
Worker $i$ : gets the original question, the full plan, prior step outputs, and a per-step instruction. Each worker has budget $B/k$ .
Aggregator: reads all step outputs and emits the final answer only. Also near-budget-neutral.

The matched-budget rule is: total thinking tokens across workers ≤ $B$ . The planner and aggregator are constrained to a tiny budget so they don’t add appreciable compute.

3.3 Subtask-parallel, Parallel-roles, Debate, Ensemble

Same matched-budget rule. The interesting design choices:

Subtask-parallel: planner enforces independence. If the planner can’t find independent subtasks, this regime gives no architectural lift.
Parallel-roles: four fixed roles — Solver, Fact Extractor, Skeptic, Second Solver. Budget $B/4$ each. This is the most “specialization-heavy” of the lot.
Debate: two debaters answer, then critique each other once, then a judge picks. Two debaters split $B/2$ each. The critique step also counts against budget.
Ensemble: $n$ candidates at temperature 0.7 split the budget; a temperature-0 judge picks. The interesting design choice is that Ensemble is the only one that benefits from sampling diversity rather than role decomposition.

The aggregator/judge prompts are deliberately not “solve the question yourself”; they are “pick / synthesize”. This is the right design — otherwise the aggregator becomes another SAS with extra context.

3.4 Evaluation

A separate Gemini-2.5-Flash judge takes (question, gold, prediction) and returns yes/no using a fixed rubric (“Is the substance of the gold present in the prediction?”). The same judge prompt is used everywhere, so judge-induced bias cancels across architectures.

3.5 Models and scale

Qwen3-30B-A3B (MoE with 3B active, thinking mode enabled).
DeepSeek-R1-Distill-Llama-70B.
Gemini-2.5-Flash and Gemini-2.5-Pro.
Thinking budgets: 100, 500, 1000, 2000, 5000, 10000.
Datasets: FRAMES, MuSiQue 4-hop.

That is 4 models × 6 budgets × 2 datasets × 7 architectures (SAS, SAS-L, 5 MAS) = 336 configurations, all run with bootstrap confidence intervals. This is one of the more thoroughly powered comparisons I have seen in recent agent literature.

4. Headline results

4.1 Table 1 — SAS wins or ties at every non-trivial budget

The reproduced averages across all models / datasets for each architecture, at six budgets:

Budget (thinking tokens)	SAS	SAS-L	Seq	Sub	Roles	Deb	Ens
100	0.290	0.337	0.364	0.322	0.363	0.370	0.280
500	0.390	0.366	0.376	0.342	0.365	0.380	0.310
1000	0.418	0.397	0.379	0.369	0.381	0.388	0.333
2000	0.421	0.420	0.389	0.383	0.398	0.403	0.372
5000	0.427	0.425	0.386	0.396	0.417	0.420	0.411
10000	0.426	0.424	0.387	0.399	0.423	0.420	0.420

Two patterns dominate:

At 100 tokens, SAS underperforms because no architecture can actually reason in 100 tokens — but the MAS’s planner/aggregator overhead is roughly the same, so MAS appears slightly stronger. This is a measurement artifact; 100 tokens is not enough to draw any conclusion.
From 500 tokens upward, SAS is the best or tied with the best at every single budget. The gap to the strongest MAS (Debate or Parallel-roles) is small but consistent.

The paper reports 95% bootstrap CIs and bolds every system whose CI overlaps with the leader’s. SAS is bolded in essentially every panel at $B \geq 500$ . Even where Debate or Parallel-roles is technically the point estimate leader (e.g., Gemini-2.5-Pro FRAMES at 2000 tokens), the SAS CI overlaps.

4.2 Token consumption

The matched-budget rule caps thinking tokens, but MAS systems often consume substantially fewer thinking tokens than they are allocated, because their planner / aggregator / worker prompts are small and the per-step <think> blocks plateau. Appendix F shows that SAS not only matches or beats MAS in accuracy, it usually does so at lower measured thinking-token cost. So on accuracy-per-thinking-token, SAS dominates even more sharply than on the matched-budget plot suggests.

4.3 Gemini model-version sweep — pattern persists

§5.2 sweeps several Gemini-2.5 model versions on MuSiQue 4-hop with unlimited thinking tokens. Two stable observations:

SAS performance increases monotonically with model capability.
SAS is competitive with Sequential MAS throughout, and usually slightly better.

This rules out the “the SAS-wins pattern is a single-checkpoint artifact” objection. It is not. It is a stable structural property of the comparison.

4.4 Context degradation — the predicted regime flip

§5.3 takes Qwen3-30B-A3B on MuSiQue 4-hop at $B = 1000$ and degrades the context four ways:

Deletion: randomly remove fraction $\alpha$ of context tokens.
Masking: replace fraction $\alpha$ with a mask token.
Substitution: replace fraction $\alpha$ with random vocabulary tokens (injects misleading content).
Distractors: append $k$ topically-similar-but-irrelevant sentences.

The crossover predicted by Lemma 2 is observed in substitution and masking at $\alpha = 0.7$ : Sequential MAS overtakes SAS. Deletion shows a weaker version of the same trend. Distractors don’t flip — SAS holds up. The interpretation matches the theory perfectly: substitution and masking are the most information-corrupting degradations, and they are also the regimes where structured multi-step pipelines have the most to add via filtering / decomposing / verification.

This is the most satisfying part of the empirical work. It transforms the headline finding from “MAS is bad” (which is wrong) to “MAS helps in a specific, identifiable failure mode of SAS” (which is far more useful).

5. The diagnostic story behind the headline

5.1 The Gemini budget-control artifact

Appendix G is, to me, the most important methodological contribution. The authors show that Gemini 2.5 Flash and Pro under-spend their declared thinking_budget in single-agent mode: the visible thought text plateaus well below the requested budget, while in MAS, multiple calls cumulatively produce more visible thought content even at the same nominal $B$ . This means a naïve comparison of SAS-at-budget- $B$ vs MAS-at-budget- $B$ in Gemini is silently giving more thinking to MAS. SAS-L exists specifically to neutralize this — once SAS is encouraged to actually use its budget, the comparison stabilizes.

This is a generalizable lesson for anyone benchmarking reasoning systems on cloud APIs: declared budget is not actual compute. You have to measure what was actually spent, not what was requested.

5.2 The paraphrasing ablation — benchmark contamination

Appendix A applies two paraphrasing methods to MuSiQue questions:

Light paraphrase: regex-based phrase swaps (e.g., “When was” → “At what time was”), preserving the multi-hop structure.
Deep paraphrase: Gemini-2.5-Flash rewrites the question entirely while preserving meaning.

Two consistent trends:

Light paraphrase decreases SAS performance modestly. This is benchmark-style fragility, not memorization (the meaning is preserved).
Deep paraphrase increases SAS performance on Gemini-2.5-Flash (.331 → .358 at 1k tokens) and is neutral or slightly positive on Qwen3.

The interpretation: original MuSiQue questions may suffer from contamination or surface-form memorization that hurts robust reasoning. Deeply rephrased questions force the model to actually reason. This is an important caveat for the entire literature that uses MuSiQue and similar benchmarks: the absolute accuracy numbers are partially confounded by surface-form effects.

5.3 Error analysis — where MAS does pick up wins, and why

Table 2 partitions MuSiQue 4-hop predictions at $B = 1000$ into four buckets:

MR/SW: Sequential MAS right, SAS wrong (72 cases for Gemini, 60 for Qwen3)
SR/MW: SAS right, Sequential MAS wrong (124 for Gemini, 96 for Qwen3)
BR: both right
BW: both wrong

Three signals:

MAS wins via breadth. In MR/SW cases, Sequential MAS canvasses about 2× more distinct entities in its thoughts and the gold appears in MAS thoughts 41.7% vs 12.5% (Gemini) of the time. SAS underexplored.
SAS wins via tight anchoring. In SR/MW cases, SAS chains keep tighter lexical overlap with the question, and the gold appears in SAS thoughts 42.7% vs 18.6% (Gemini). MAS overexplored and drifted.
Extraction failure is a big chunk of MAS losses. In SR/MW, 23 Gemini cases had the gold surfaced in MAS thoughts but not extracted into the final answer. The aggregator dropped the right span.

This is the most actionable bit for system builders. The MAS architectures fail in two distinctive ways: (1) the aggregator step throws away a correct intermediate, and (2) breadth without late constraint re-checking degrades precision. Both are fixable — but they are exactly the modes where naïve MAS implementations leak accuracy.

6. Where I push back

I think the headline result is robust, but a few things are softer than the paper makes them sound.

6.1 The theory is almost vacuous unless you control compute correctly

Lemma 1 is just “you can always simulate the channel”. It does not say SAS will simulate the channel; it says the Bayes optimum allows it to. Real LLMs are bounded by their inference algorithm and by their training. Two LLMs could in principle reach $P_e(C)$ but neither does. So the theory is really a consistency check: it rules out the claim that MAS gives architectural lift independent of compute, but it doesn’t predict the magnitude of the SAS advantage in any particular system. The empirical study is doing all the work.

6.2 “Thinking tokens” is the right axis, but not the only one

The paper fixes thinking tokens. It does not fix wall-clock latency, total API tokens, or dollar cost. For deployment, the right axis depends on the constraint. If you are latency-bound (interactive chat), SAS wins by a wider margin because MAS has roundtrip overhead. If you are cost-bound (offline batch), the picture is largely the same because MAS still consumes more in absolute tokens for matched accuracy. But if you are throughput-bound at fixed quality, Ensemble with majority-vote at modest budget is sometimes competitive and gets you parallelism for free. The paper notes that Ensemble is the only architecture that becomes the best on Gemini-2.5-Pro FRAMES at $B \geq 5000$ . That is a deployment-relevant niche.

6.3 Multi-hop reasoning is not all of agents

The “multi-agent” world includes orchestration, tool use, retrieval, long-horizon planning, code generation, and embodied control. This paper is exclusively about text-only multi-hop reasoning. The DPI argument applies cleanly there because the only thing being compared is reasoning over the same fixed context. The moment you add tools — where each agent can issue distinct API calls and bring new information into the system — the equivalence between $C$ for SAS and $C$ for MAS breaks. MAS can effectively enlarge $C$ via independent tool calls; SAS cannot. So the strong DPI conclusion does not transfer to tool-using agent settings. The paper is clear about this in §C (Limitations), but readers should not over-generalize.

6.4 The 4-hop MuSiQue ceiling is low

Even Gemini-2.5-Pro tops out around 0.45 on MuSiQue 4-hop. At that accuracy, a 0.02 architecture-level difference is real but small, and 95% CIs overlap often enough that the bolding scheme makes more architectures look “tied for first” than the point estimates suggest. The Qwen3-30B and DeepSeek-R1-70B numbers are even tighter. So at any given budget the SAS advantage over the best MAS is real but typically 0.02–0.05. The take-home is “MAS isn’t worth the complexity overhead”, not “SAS is a quantum leap”.

6.5 Budget bookkeeping in Sequential is fragile

If your planner is too aggressive and emits, say, 7 steps, then each worker gets $B/7$ thinking tokens — sometimes under 200. Workers under 500 thinking tokens reliably produce thin reasoning. The Sequential MAS results may be artificially weak because the planner over-decomposes. A learned planner that adapts $k$ to question complexity (something like MAS-Orchestra, cited as Ke et al. 2026 in the references) might close the gap. The paper acknowledges this implicitly when it says “Sequential is the cleanest analogue”, but a future iteration should sweep $k$ explicitly.

7. What I would change in a follow-up

I would do four things if I were extending this work:

Token-spent normalization. Rerun all comparisons normalizing on actual thinking tokens used, not requested. The paper reports both, but the headline plot is on requested budget. The actual-tokens plot would show an even cleaner SAS advantage.
Tool-augmented variant. Add a single-agent system with retrieval/tools vs a MAS where each agent has independent tool access. This tests the DPI escape hatch — when MAS can enlarge $C$ , the architectural gap should close or invert.
Calibrated-confidence aggregation. The MR/SW analysis shows MAS often finds the right answer but fails to extract it. Try aggregators that score candidates by calibrated confidence (e.g., self-consistency probabilities) rather than picking by judge. This should claw back some MAS losses.
Latency profile. Add wall-clock plots. For interactive deployments, even where MAS matches on accuracy, it loses on latency. This is a deployment story the paper does not tell.

8. Reproducibility notes

The paper provides:

Full architecture prompts in Appendix D for all 7 systems (this is excellent — most MAS papers don’t).
Temperature settings: 0.7 for Ensemble, 0 elsewhere.
LLM-as-judge prompt in §D.7.
All hyperparameters at defaults except temperature.

Not provided in the body (but presumably in a release I have not seen): the code for the budget-splitting wrapper, the exact planner output JSONs for each question, and the FRAMES/MuSiQue filtering scripts. If you reproduce on Qwen3 / DeepSeek you should be able to recover their numbers within bootstrap noise; on Gemini, the API-budget artifact means you should expect SAS to under-spend its declared budget unless you also implement SAS-L. Treat any direct comparison without SAS-L on Gemini as compromised.

9. Boundary conditions — when this finding does not apply

Stated explicitly so I don’t over-generalize:

Tool-using agents: $C$ is no longer fixed across architectures; DPI argument does not transfer. Open question.
Vision/multimodal reasoning: not tested.
Safety-critical pipelines where redundancy is desirable for failure isolation: not a quality question, MAS may still be preferable.
Long-horizon planning (Voyager-style, MetaGPT-style multi-day agent work): not tested.
Tasks with explicit independent subtasks (e.g., parallel code-file edits): not in the multi-hop QA setting. Subtask-parallel may genuinely shine.
Models without thinking-mode (most pre-o1 OpenAI, pre-2.5 Gemini, smaller Llamas): budget controllability is much weaker; the comparison framework breaks down.

Inside the boundary — multi-hop world-knowledge reasoning on capable reasoning models — the finding is robust and well-supported.

10. Comparison with prior work

The paper is consistent with and extends several recent results:

Anthropic (2025) “How we built our multi-agent research system”: explicitly attributes much of the apparent MAS advantage to additional compute. The Tran & Kiela paper formalizes this with DPI and controls for it experimentally.
Wang et al. 2024, “Reasoning in Token Economies”: budget-aware evaluation shows many elaborate prompting strategies fail to outperform simple baselines once budget is matched. Tran & Kiela apply the same lens to SAS vs MAS.
Cemri et al. 2025, “Why do multi-agent LLM systems fail?”: catalog of MAS failure modes (drift, information loss, evaluation artifacts) — the error analysis in this paper instantiates the same failure types.
Kim et al. 2025, “Towards a science of scaling agent systems”: agentic benefits concentrate in weaker models and harder regimes, diminish as base models improve. Tran & Kiela’s monotonic Gemini sweep is in the same direction.
Ke et al. 2026, “MAS-Orchestra”: learned orchestration of MAS, controlled benchmarks. The natural follow-up: does learned orchestration close the budget-controlled gap?

The Tran & Kiela contribution is the combination: a clean theoretical argument, a budget-controlled empirical matrix, a sharp boundary condition (context degradation), and a methodological audit (Gemini API artifacts, MuSiQue paraphrase fragility).

11. A back-of-envelope deployment scenario

Suppose you are building a multi-hop QA system on top of Gemini-2.5-Pro. Two options:

Option A: SAS with $B = 5000$ thinking tokens. Average MuSiQue 4-hop accuracy ≈ 0.42. Cost ≈ 5000 thinking tokens × $5/M =$ 0.025 per question. Latency ≈ 1 API roundtrip + 5000 token generation.
Option B: Sequential MAS, planner + 5 workers + aggregator, each worker at $B = 1000$ . Average accuracy ≈ 0.39. Cost ≈ ≥ 5000 thinking tokens + 7 API roundtrips. Latency ≈ 7× SAS.

The accuracy delta is ~0.03 in favor of SAS. The cost delta is roughly neutral (because MAS workers under-spend budgets), but the latency delta is 7×. For interactive use, the SAS wins on every metric. The only reason to go MAS is if you are in a context-degraded regime — long noisy contexts, retrieval-augmented contexts with low precision, or adversarial settings — where the DPI flip applies.

The paper does not write this scenario out, but the data supports it directly.

12. Practical checklist for SAS-first agent design

Based on the paper’s findings and my own analysis:

Default to SAS for multi-hop reasoning on capable reasoning models. Make MAS the fallback, not the baseline.
Measure actual thinking tokens consumed, not requested. Don’t trust API budgets blindly.
Use SAS-L (the analyze-from-multiple-perspectives prefix) when working with Gemini-2.5 — it is a free win that costs nothing extra.
Profile context utilization before adopting MAS. If your context is short and clean, MAS will not help. If it is long and noisy, MAS might.
Use Debate or Parallel-roles if you must use MAS, and avoid pure Ensemble at low budgets — it has the worst aggregate performance in the table.
Audit your aggregator. The MAS extraction failures (gold in thoughts, missing in answer) suggest the aggregator is throwing away signal. Consider self-consistency or token-level confidence scoring.
Paraphrase your benchmark questions to expose memorization. If accuracy drops a lot under deep paraphrase, your benchmark is leaking surface forms.
Treat 100-token budgets as a control, not a comparison. Nothing meaningful happens below ~500 thinking tokens.

13. Closing thought

If I had to summarize this paper to a colleague in one sentence: a multi-agent LLM system is best understood as a single-agent system that has been given more compute and a worse internal communication channel; once you correct for both, the architectural lift disappears almost everywhere except in heavily degraded contexts. That is a sharp claim, it is now well-supported, and it should shape how the next generation of agent frameworks is designed and evaluated.

The next-most-useful follow-up question is the one the paper deliberately doesn’t tackle: what does the picture look like when each agent can issue independent tool calls? That’s where MAS plausibly enlarges $C$ rather than compressing it, and the DPI argument fails in the natural direction. Until that paper is written, the operational default for purely-reasoning MAS should be “don’t, unless you have measured otherwise”.

References

Tran, D., & Kiela, D. (2026). Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets. arXiv:2604.02460v2.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.
Anthropic (2025). “How we built our multi-agent research system”. Engineering blog.
Wang, J., et al. (2024). “Reasoning in token economies: budget-aware evaluation of LLM reasoning strategies”. EMNLP 2024.
Cemri, M., et al. (2025). “Why do multi-agent LLM systems fail?”. arXiv:2503.13657.
Krishna, S., et al. (2025). “Fact, fetch, and reason: a unified evaluation of retrieval-augmented generation” (FRAMES). NAACL 2025.
Trivedi, H., et al. (2022). “MuSiQue: multihop questions via single-hop question composition”. TACL.
Liu, N. F., et al. (2024). “Lost in the middle: how language models use long contexts”. TACL 12.
Du, Y., et al. (2024). “Improving factuality and reasoning in language models through multiagent debate”. ICML 2024.
Shinn, N., et al. (2023). “Reflexion: language agents with verbal reinforcement learning”. NeurIPS 2023.
Li, J., et al. (2024). “More agents is all you need”. arXiv:2402.05120.
Kim, Y., et al. (2025). “Towards a science of scaling agent systems”. arXiv:2512.08296.
Ke, Z., et al. (2026). “MAS-Orchestra: understanding and improving multi-agent reasoning”. arXiv:2601.14652.

Reviewed by Zhongzhu Zhou, 2026-05-18. Comments welcome.