Page 6 / 17

200 posts in total. Keep on posting.

Showing posts 61–72 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

05-31 EN

Group Sequence Policy Optimization: A Sequence-Level RL Algorithm for Training Large Language Models

GSPO replaces GRPO's token-level importance ratios with a single sequence-level ratio, yielding more stable and efficient RL for LLMs — and crucially fixing the training collapse that plagues RL on large Mixture-of-Experts models. A from-scratch walkthrough of the math and algorithm, plus a critical look at what the paper leaves untested.
05-31 中

Group Sequence Policy Optimization：序列级重要性采样修正 GRPO 的 RL 训练方法

GSPO 把 GRPO 的 token 级重要性比率换成单一的序列级比率，让 LLM 的强化学习训练更稳、更省，并解决了大型 MoE 模型上 RL 训练崩溃的难题。本文从零讲清它的数学动机与算法细节，并批判性地分析了论文尚未验证的部分。
05-29 EN

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

How double-sided KL-aware whitening, adaptive heterogeneous rank allocation, and loss-aware remapping combine to push SVD-based LLM compression to a new state of the art — with 4.34× decode throughput and minimal quality loss even at 60% parameter removal.
05-29 中

IO-SVD：基于输入输出双侧白化的自适应秩LLM压缩方法

KL散度感知的双侧白化 + 贪婪异构秩分配 + 损失感知量化重映射，三招组合将SVD压缩推到新的SOTA——在LLaMA-7B 80%保留率下PPL降至5.59，同时带来4.34倍解码吞吐提升。
05-28 EN

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

How Moonshot AI's Kimi serving platform redesigned LLM infrastructure around KV cache disaggregation—achieving 525% throughput gains for long-context workloads while maintaining strict TTFT and TBT SLO compliance.
05-28 中

Mooncake：以 KV Cache 为核心的大模型推理服务解耦架构

Moonshot AI（Kimi）如何将整个 LLM 服务系统围绕 KV Cache 的调度、复用与迁移重新设计——在长文本场景下实现 525% 的吞吐量提升，同时满足严格的 TTFT 和 TBT 延迟 SLO。
05-26 EN

SimPO: Simple Preference Optimization with a Reference-Free Reward

SimPO replaces DPO's reference-model-dependent implicit reward with a length-normalized average log probability, eliminates the reference model entirely, adds a target reward margin to the Bradley-Terry objective, and achieves up to +6.4 points on AlpacaEval 2 and +7.5 on Arena-Hard — all while keeping response length controlled. The Gemma-2-9B-it SimPO model ranked #1 on Chatbot Arena among all <10B models.
05-26 中

SimPO：无需参考模型的简洁偏好优化

SimPO 将 DPO 依赖参考模型的隐式奖励，替换为长度归一化的平均对数概率，彻底移除参考模型，并在 Bradley-Terry 目标中加入目标奖励边距。最终在 AlpacaEval 2 上超越 DPO 最高 +6.4 分、在 Arena-Hard 上超越最高 +7.5 分，且不引入回答长度膨胀。基于 Gemma-2-9B-it 的 SimPO 模型在 Chatbot Arena 人类真实投票中排名全部 10B 以下模型第一。
05-25 EN

CodeAct: Executable Code Actions Elicit Better LLM Agents

CodeAct proposes using executable Python code as the single unified action space for LLM agents, replacing fragmented JSON/text tool calls. With control flow, data reuse, existing libraries, and automated error feedback, agents using CodeAct achieve up to 20% higher success rates across 17 LLMs — and fine-tuned CodeActAgent rivals closed-source models on agent benchmarks.
05-25 中

CodeAct：用可执行代码驱动更强的 LLM Agent

CodeAct 提出用可执行 Python 代码作为 LLM Agent 的统一动作空间，取代碎片化的 JSON/文本工具调用。借助控制流、变量复用、现有软件库和自动错误反馈，CodeAct Agent 在 17 个大模型上的成功率提升最高达 20%——而微调后的 CodeActAgent 7B 模型在 Agent 基准上可比肩百亿规模闭源模型。
05-24 EN

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-2 pushes GPU attention kernels from 25-40% to 50-73% of theoretical A100 peak by fixing three concrete inefficiencies in FA1: unnecessary non-matmul FLOPs, underutilized sequence-length parallelism, and a warp-communication bottleneck. This review unpacks every algorithmic change with full derivations, the GPU execution model, and why each fix matters.
05-24 中

FlashAttention-2：更好的并行策略与线程块工作划分

FlashAttention-2 把 GPU 注意力核的效率从 A100 理论峰值的 25-40% 提升到 50-73%，靠的是精准修复 FA1 的三个具体瓶颈：多余的非矩阵乘 FLOP、序列维度并行度不足，以及 warp 内通信瓶颈。这篇笔记从 GPU 硬件原理出发，完整拆解每一项改动的数学推导、实现原理和设计边界。