Page 4 / 17
200 posts in total. Keep on posting.
Showing posts 37–48 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- EN
ForeMoE: Micro-step-level MoE Load Balancing for RL Post-training via Routing Foresight
ForeMoE exploits the unique structure of RL post-training — where rollout routing decisions are replayed in later stages — to predict and proactively balance MoE expert loads at micro-step granularity, achieving up to 1.45x speedup over state-of-the-art RL training systems on 64 GPUs.
- 中
ForeMoE:利用路由预见性实现 RL 后训练中 MoE 微步级负载均衡
ForeMoE 利用 RL 后训练特有的路由回放结构——rollout 阶段的路由决策在后续阶段被重用——实现对每个梯度微步的 MoE 专家负载精确预测与主动均衡,在 64 张 GPU 上实现最高 1.45× 的端到端加速。
- EN
SliceGPT: Post-Training LLM Compression via Computational Invariance
SliceGPT exploits an exact structural symmetry in transformers to physically delete rows and columns from every weight matrix, achieving 25% parameter reduction with 99% zero-shot performance on LLAMA2-70B and OPT-66B — no custom hardware kernels required.
- 中
SliceGPT 阅读笔记:用计算不变性删除 Transformer 的行与列
SliceGPT 证明了 Transformer 计算对正交基变换具有精确不变性,并以 PCA 为工具将权重矩阵旋转到方差最集中的方向后直接裁去低方差维度,在 LLAMA2-70B 上以 25% 参数缩减保住 99% 零样本性能,且无需任何自定义 CUDA 算子。
- EN
MegaScale: Engineering 55% MFU at 12,288 GPUs for LLM Training
MegaScale is ByteDance's full-stack production system for training LLMs at more than 10,000 GPUs, achieving 55.2% Model FLOPs Utilization through co-designed algorithmic optimizations, communication overlapping, and deep observability for fault tolerance.
- 中
MegaScale:ByteDance 如何在 12,288 块 GPU 上实现 55% MFU 的大规模 LLM 训练
MegaScale 是 ByteDance 用于超大规模 LLM 训练的生产系统,通过算法-系统协同设计、通信计算重叠、算子优化和深度可观测性,在 12,288 块 GPU 上实现了 55.2% 的 Model FLOPs Utilization,比 Megatron-LM 提升 1.34 倍。
- EN
KeepKV: Lossless KV Cache Compression via Electoral Votes and ZIP-Merging
KeepKV introduces Electoral Votes and Zero Inference-Perturbation Merging to achieve single-step lossless KV cache compression, provably fixing the Attention Sag problem that plagues all prior merging methods.
- 中
KeepKV:用「选举票」机制和零扰动合并实现无损 KV 缓存压缩
KeepKV 提出了「选举票」机制和零推理扰动合并(ZIP-Merging),在数学上证明了单步无损 KV 缓存压缩,从根本上解决了所有现有合并方法都存在的「注意力衰落」问题。
- EN
VAPO: Value-Augmented Proximal Policy Optimization for Long-CoT Reasoning
VAPO revives value-model-based RL for LLM reasoning by introducing Length-adaptive GAE and a suite of complementary techniques, reaching 60.4 on AIME 2024 with Qwen2.5-32B — outperforming DAPO by more than 10 points in under 5,000 training steps.
- 中
VAPO:面向长链推理的价值增强近端策略优化
VAPO 通过引入长度自适应 GAE 以及一套互补技术,让基于价值模型的强化学习重新超越了无价值模型方法,在 Qwen2.5-32B 上以不足 5000 步达到 AIME 2024 得分 60.4,比 DAPO 高出 10 分以上。
- EN
ExpWeaver: How LLM Agents Learn from Past Experience in Latent Space
ExpWeaver replaces text-based experience retrieval with latent-space RAG — encoding past agent trajectories as dense hidden-state vectors and retrieving them at every decoding step via cross-attention, achieving SOTA on 12/13 tasks with 1.5-2x better token efficiency.
- 中
ExpWeaver:LLM 智能体如何在隐空间中从经验中学习
ExpWeaver 用潜空间 RAG 替代文本检索——将智能体的历史轨迹编码为稠密隐状态向量,在每个解码步骤通过交叉注意力检索并融合,在 12/13 个任务上取得 SOTA,同时将词元消耗降低 1.5-2 倍。