Page 6 / 17
200 posts in total. Keep on posting.
Showing posts 61–72 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- EN
Group Sequence Policy Optimization: A Sequence-Level RL Algorithm for Training Large Language Models
GSPO replaces GRPO's token-level importance ratios with a single sequence-level ratio, yielding more stable and efficient RL for LLMs — and crucially fixing the training collapse that plagues RL on large Mixture-of-Experts models. A from-scratch walkthrough of the math and algorithm, plus a critical look at what the paper leaves untested.
- 中
Group Sequence Policy Optimization:序列级重要性采样修正 GRPO 的 RL 训练方法
GSPO 把 GRPO 的 token 级重要性比率换成单一的序列级比率,让 LLM 的强化学习训练更稳、更省,并解决了大型 MoE 模型上 RL 训练崩溃的难题。本文从零讲清它的数学动机与算法细节,并批判性地分析了论文尚未验证的部分。
- EN
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
How double-sided KL-aware whitening, adaptive heterogeneous rank allocation, and loss-aware remapping combine to push SVD-based LLM compression to a new state of the art — with 4.34× decode throughput and minimal quality loss even at 60% parameter removal.
- 中
IO-SVD:基于输入输出双侧白化的自适应秩LLM压缩方法
KL散度感知的双侧白化 + 贪婪异构秩分配 + 损失感知量化重映射,三招组合将SVD压缩推到新的SOTA——在LLaMA-7B 80%保留率下PPL降至5.59,同时带来4.34倍解码吞吐提升。
- EN
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
How Moonshot AI's Kimi serving platform redesigned LLM infrastructure around KV cache disaggregation—achieving 525% throughput gains for long-context workloads while maintaining strict TTFT and TBT SLO compliance.
- 中
Mooncake:以 KV Cache 为核心的大模型推理服务解耦架构
Moonshot AI(Kimi)如何将整个 LLM 服务系统围绕 KV Cache 的调度、复用与迁移重新设计——在长文本场景下实现 525% 的吞吐量提升,同时满足严格的 TTFT 和 TBT 延迟 SLO。
- EN
SimPO: Simple Preference Optimization with a Reference-Free Reward
SimPO replaces DPO's reference-model-dependent implicit reward with a length-normalized average log probability, eliminates the reference model entirely, adds a target reward margin to the Bradley-Terry objective, and achieves up to +6.4 points on AlpacaEval 2 and +7.5 on Arena-Hard — all while keeping response length controlled. The Gemma-2-9B-it SimPO model ranked #1 on Chatbot Arena among all <10B models.
- 中
SimPO:无需参考模型的简洁偏好优化
SimPO 将 DPO 依赖参考模型的隐式奖励,替换为长度归一化的平均对数概率,彻底移除参考模型,并在 Bradley-Terry 目标中加入目标奖励边距。最终在 AlpacaEval 2 上超越 DPO 最高 +6.4 分、在 Arena-Hard 上超越最高 +7.5 分,且不引入回答长度膨胀。基于 Gemma-2-9B-it 的 SimPO 模型在 Chatbot Arena 人类真实投票中排名全部 10B 以下模型第一。
- EN
CodeAct: Executable Code Actions Elicit Better LLM Agents
CodeAct proposes using executable Python code as the single unified action space for LLM agents, replacing fragmented JSON/text tool calls. With control flow, data reuse, existing libraries, and automated error feedback, agents using CodeAct achieve up to 20% higher success rates across 17 LLMs — and fine-tuned CodeActAgent rivals closed-source models on agent benchmarks.
- 中
CodeAct:用可执行代码驱动更强的 LLM Agent
CodeAct 提出用可执行 Python 代码作为 LLM Agent 的统一动作空间,取代碎片化的 JSON/文本工具调用。借助控制流、变量复用、现有软件库和自动错误反馈,CodeAct Agent 在 17 个大模型上的成功率提升最高达 20%——而微调后的 CodeActAgent 7B 模型在 Agent 基准上可比肩百亿规模闭源模型。
- EN
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention-2 pushes GPU attention kernels from 25-40% to 50-73% of theoretical A100 peak by fixing three concrete inefficiencies in FA1: unnecessary non-matmul FLOPs, underutilized sequence-length parallelism, and a warp-communication bottleneck. This review unpacks every algorithmic change with full derivations, the GPU execution model, and why each fix matters.
- 中
FlashAttention-2:更好的并行策略与线程块工作划分
FlashAttention-2 把 GPU 注意力核的效率从 A100 理论峰值的 25-40% 提升到 50-73%,靠的是精准修复 FA1 的三个具体瓶颈:多余的非矩阵乘 FLOP、序列维度并行度不足,以及 warp 内通信瓶颈。这篇笔记从 GPU 硬件原理出发,完整拆解每一项改动的数学推导、实现原理和设计边界。