Page 1 / 17

200 posts in total. Keep on posting.

Showing posts 1–12 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

  • EN

    Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization

    Tangram decouples parallelization planning from GPU heterogeneity by abstracting heterogeneous clusters into homogeneous GPU islands, then composing partial plans from existing parallelizers into work-balanced pipelines — achieving up to 2.3× higher throughput than heterogeneous baselines while retaining full support for expert parallelism, ZeRO, and activation recomputation.

  • Tangram:为异构GPU集群隐藏硬件差异的高效LLM并行化系统

    Tangram将异构GPU集群抽象为同构GPU岛,让现有的同构并行化器生成部分计划,再通过动态规划组合成全局负载均衡的流水线——在保留专家并行、ZeRO、激活重计算等全部特性的同时,比现有异构并行化器吞吐量高出最多2.3倍。

  • EN

    SSV: Sparse Speculative Verification for Efficient LLM Inference

    SSV resolves the structural mismatch between speculative decoding and dynamic sparse attention by grouping overlapping verifier queries, fusing NSA branches across layers, and adaptively orchestrating draft-verify strategies per prompt — achieving up to 3.49x end-to-end throughput on H100 GPUs.

  • SSV:稀疏投机验证——在动态稀疏注意力中做投机解码

    SSV 通过重叠感知的查询分组、刷新/复用式 NSA 核融合与自适应策略编排,彻底解决了投机解码与动态稀疏注意力的结构性矛盾,在 H100 GPU 上实现最高 3.49 倍端到端吞吐提升。

  • EN

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale — Technical Review

    DAPO introduces four targeted algorithmic fixes to GRPO — asymmetric clip bounds, dynamic sampling, token-level gradient averaging, and soft overlong penalties — achieving 50pct accuracy on AIME 2024 with Qwen2.5-32B in 50pct fewer steps than DeepSeek-R1-Zero.

  • DAPO:大规模 LLM 强化学习系统阅读笔记

    DAPO 针对 GRPO 的四个具体问题分别提出解法——非对称截断(Clip-Higher)、动态采样、逐 Token 策略梯度损失和软超长惩罚——使 Qwen2.5-32B 在 AIME 2024 上达到 50pct 准确率,所用训练步数比 DeepSeek-R1-Zero 减少一半。

  • EN

    ACTS: Steering How LLMs Reason, Not Just How Long

    ACTS introduces an RL-trained controller agent that steers a frozen reasoning LLM step-by-step through a budget-aware Markov decision process, achieving Vanilla-level accuracy with up to 57 percent token savings and even surpassing full-thinking baselines on harder tasks by eliminating overthinking spirals.

  • ACTS:用强化学习训练的控制器,让 LLM 推理更聪明而不只是更短

    ACTS 把链式推理的控制建模为预算约束下的马尔可夫决策过程,训练一个轻量控制器 agent 逐步为冻结推理模型分配推理策略,以最多节省 57% token 的代价维持甚至超越原模型精度。

  • EN

    Moebius: Seamless Runtime Parallelism Switching for MoE LLM Serving

    Moebius enables runtime switching between expert parallelism and tensor parallelism for MoE serving, completing each switch in 215-434 ms with only 2.4% memory overhead and achieving 1.16-1.25x speedup on RL rollouts.

  • Moebius:为 MoE 大模型推理服务实现无缝运行时并行策略切换

    Moebius 允许在 MoE 模型推理服务过程中于专家并行和张量并行之间实时切换,每次切换仅需 215-434 ms、额外内存开销仅 2.4%,RL rollout 速度提升 1.16-1.25 倍。

  • EN

    JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

    JetSpec resolves the causality-efficiency dilemma in speculative decoding by training a causal parallel draft head that generates all nodes of a candidate tree in one forward pass while preserving branch-wise autoregressive conditioning through a tree-causal attention mask — achieving up to 9.64x speedup on MATH-500.

  • JetSpec:用并行树草稿突破推测解码的扩展上限

    JetSpec 通过训练一个因果并行草稿头,在单次前向传播中生成推测解码候选树的全部节点,同时借助树因果注意力掩码保留分支级因果条件依赖——将 MATH-500 端到端加速比提升至 9.64 倍。