Page 7 / 17

200 posts in total. Keep on posting.

Showing posts 73–84 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

  • EN

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-R1 shows that reasoning in large language models can emerge from pure reinforcement learning — without any human-annotated reasoning traces. This review unpacks GRPO, the multi-stage training pipeline, distillation to smaller models, and why this paper changed how the field thinks about post-training.

  • DeepSeek-R1:用强化学习激发大语言模型的推理能力

    DeepSeek-R1 证明了大语言模型的推理能力可以从纯强化学习中涌现,不需要任何人类标注的推理轨迹。本文深入拆解 GRPO 算法、四阶段训练流水线、蒸馏方法及背后的设计取舍。

  • EN

    DoRA: Weight-Decomposed Low-Rank Adaptation — Technical Review

    DoRA (ICML 2024 Oral) decomposes pretrained weights into magnitude and direction components, then applies LoRA only to the directional part. This decomposition mirrors how full fine-tuning actually updates weights, closing the accuracy gap between LoRA and full FT without adding any inference overhead.

  • DoRA:权重分解低秩自适应——用幅度与方向解耦提升 LoRA 学习能力 | 阅读笔记

    DoRA(ICML 2024 Oral)把预训练权重分解为「幅度向量」和「方向矩阵」两部分,再对方向部分施加 LoRA——这个分解在数学上解释了 LoRA 和全量微调之间的准确率差距,并在几乎不增加参数量的情况下把这个差距补上。

  • EN

    SGLang: Efficient Execution of Structured Language Model Programs — Technical Review

    A technical review of SGLang (NeurIPS 2024), the paper that turned a domain-specific frontend language for multi-call LLM programs into a co-designed runtime. The core trick is RadixAttention — treat the KV cache as an LRU radix tree so prefix sharing happens automatically across calls, instances, and tensor-parallel ranks. SGLang adds a compressed-FSM decoder for regex-constrained outputs and API speculative execution for black-box endpoints. Up to 6.4× throughput and 3.7× latency reduction over vLLM, Guidance, and LMQL on agent control, tree-of-thought, JSON decoding, RAG, and multi-modal workloads.

  • SGLang:为 LM 程序而生的前端 DSL + 协同设计运行时 —— 阅读笔记

    SGLang(NeurIPS 2024)的阅读笔记。它把面向多次调用 LM 程序的 Python 嵌入式 DSL 与一个协同设计的运行时绑在一起,核心招式 RadixAttention 把 KV 缓存做成 LRU 化的 radix 树,让前缀复用在 调用之间 / 实例之间 / TP rank 之间 自动发生;再配上压缩 FSM 解码 与 API 推测执行,在 Llama-7B 上比 vLLM / Guidance / LMQL 拿到最多 6.4× 吞吐、3.7× 延迟。笔记用大半篇幅做铺垫(LM program、KV cache、PagedAttention、radix tree、LMQL/Guidance、连续批处理、约束解码)再讲算法,然后做实验解读与文献定位。

  • EN

    Sarathi-Serve: Taming the Throughput–Latency Tradeoff in LLM Inference — Technical Review

    A technical review of Sarathi-Serve (OSDI 2024). The paper argues that prefill-prioritizing schedulers (vLLM, Orca) trade tail latency for throughput, and decode-prioritizing ones (FasterTransformer) trade throughput for latency. Sarathi-Serve breaks the tradeoff by splitting prefills into fixed-size chunks, fusing them with on-going decodes, and picking a token budget that hits the linear-layer compute-bound knee while staying inside a TBT SLO. Up to 2.6×, 3.7×, and 5.6× capacity gains on Mistral-7B, Yi-34B, and Falcon-180B. The review front-loads the prerequisites (prefill vs decode, arithmetic intensity, TBT tails, PP pipeline bubbles) before the algorithm and evaluation.

  • Sarathi-Serve:用 chunked-prefill 驯服 LLM 推理的吞吐-延迟权衡 —— 阅读笔记

    Sarathi-Serve(OSDI 2024)的阅读笔记。chunked-prefill 把 prefill 切成等大小的块,stall-free batching 让 prefill 块和正在 decode 的请求拼成同一个 iteration,彻底重写了 LLM 推理调度器:Mistral-7B 2.6×、Yi-34B 3.7×、Falcon-180B(PP)5.6× 容量。今天 vLLM / TensorRT-LLM / SGLang 用的 "chunked prefill" 就是这篇论文。笔记花大半篇幅讲清前置(prefill vs decode、arithmetic intensity、TBT 尾延迟、PP pipeline bubble),再讲算法与实验。

  • EN

    KTO: Model Alignment as Prospect Theoretic Optimization — Technical Blog Review

    Technical review of KTO (Ethayarajh et al., Stanford / Contextual AI, ICML 2024, arXiv:2402.01306): reframes DPO and PPO-Clip through Kahneman-Tversky prospect theory as a family of Human-Aware Losses (HALOs), then derives Kahneman-Tversky Optimization — an alignment objective that needs only a binary desirable/undesirable signal per response, no preference pairs. KTO matches or exceeds DPO across Pythia-1.4B to Llama-30B (GSM8K +13.5 pts on Zephyr-β-SFT/UltraFeedback) and stays robust under 1:10 class imbalance via λD / λU reweighting.

  • KTO:把模型对齐看成「前景理论」优化 —— 阅读笔记

    KTO 阅读笔记:把 DPO 与 PPO-Clip 放到 Kahneman-Tversky 前景理论框架下,统一为 Human-Aware Losses (HALO),再推出只需『二元 desirable/undesirable 信号』的 Kahneman-Tversky Optimization。在 Pythia-1.4B → Llama-30B 全尺度追平或超过 DPO(Zephyr-β-SFT + UltraFeedback 上 GSM8K +13.5 pts),且在 1:10 类不平衡下仍稳健。Stanford / Contextual AI, ICML 2024, arXiv 2402.01306。

  • EN

    Why Single-Agent LLMs Beat Multi-Agent Systems on Multi-Hop Reasoning — A Budget-Controlled Story

    Technical review of Tran & Kiela (Stanford, arXiv 2604.02460): once you fix the thinking-token budget as the sole resource axis, single-agent LLMs (SAS) match or beat every multi-agent architecture (Sequential / Subtask-parallel / Parallel-roles / Debate / Ensemble) across a 336-configuration matrix (Qwen3-30B-A3B, DeepSeek-R1-Distill-Llama-70B, Gemini-2.5-Flash/Pro × FRAMES + MuSiQue 4-hop × 100–10000 tokens). The paper grounds this in a clean Data Processing Inequality argument, identifies the regime flip under heavy context degradation (substitution/masking at α=0.7), and audits the Gemini 2.5 thinking_budget API artifact that motivates the SAS-L scaffold.

  • 思考预算锁死之后,单 Agent 为什么打过多 Agent —— 阅读笔记

    Tran & Kiela (Stanford, arXiv 2604.02460) 阅读笔记:把『思考 token 预算』作为唯一资源轴,单 Agent (SAS) 在 Qwen3-30B-A3B / DeepSeek-R1-Distill-70B / Gemini-2.5-Flash/Pro × FRAMES + MuSiQue 4-hop × 100–10000 预算的 336 个配置上几乎处处与最强多 Agent (Sequential / Subtask-parallel / Parallel-roles / Debate / Ensemble) 持平或更优。论文给出 Data Processing Inequality 的贝叶斯论证、上下文退化下的反向 DPI 相位变化,以及 Gemini 2.5 thinking_budget API 计量伪影的审计(即 SAS-L 前缀的来源)。