Page 3 / 17
200 posts in total. Keep on posting.
Showing posts 25–36 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- EN
LASER: How Throwing Away 99% of a Weight Matrix Can Make LLMs Smarter
LASER shows the counterintuitive result that selectively replacing weight matrices with heavily truncated SVD approximations — keeping as little as 1% of the rank — can boost LLM reasoning accuracy by up to 27 percentage points without any fine-tuning.
- 中
LASER:丢掉 99% 的矩阵秩,LLM 推理准确率反而提高了 27%
LASER 表明:对 Transformer 特定层的权重矩阵做极度激进的 SVD 截断——只保留 1% 的秩——能在不做任何微调的情况下让 LLM 的推理准确率提升最高 27 个百分点。
- EN
LUMEN: Load-Aware Coordinated Failure Recovery for Distributed LLM Serving
LUMEN treats GPU worker failure recovery in LLM serving as a load-aware coordination problem across three decision points, cutting mean TTFT by 44% and recovery time by 50% over stop-and-restart.
- 中
LUMEN:面向分布式大模型推理的负载感知协同故障恢复
LUMEN 把 LLM 服务集群中的 GPU worker 故障恢复,重新定义为一个负载感知的协同调度问题,在四种 worker 的 Qwen3-32B 实验中,将平均 TTFT 降低 44%、恢复时间缩短 50%。
- EN
OScaR: Occam's Razor for Extreme KV Cache Quantization
OScaR proposes Canalized Rotation and Omni-Token Scaling to fix Token Norm Imbalance in INT2 KV cache quantization, achieving near-lossless accuracy with 5.3× memory reduction and 4.1× throughput gain.
- 中
OScaR:极端 KV 缓存量化的奥卡姆剃刀
OScaR 用通道旋转与全方位令牌缩放解决逐通道量化中的令牌范数不平衡问题,在 INT2 下实现近无损精度并带来 5.3× 内存压缩和 4.1× 吞吐提升。
- EN
Back to Basics: Revisiting REINFORCE Style Optimization for RLHF (RLOO)
RLOO shows that PPO is unnecessarily complex for RLHF — a simple REINFORCE Leave-One-Out estimator using k completions per prompt outperforms PPO, DPO, and RAFT with fewer models and no critic network.
- 中
回归基础:用 RLOO 重新思考 RLHF 中的策略梯度优化
RLOO 证明了 PPO 对于 RLHF 来说过于复杂——只用每个 prompt 采 k 条输出、用其他 k-1 条的平均奖励作基线的 REINFORCE Leave-One-Out,在所有测试模型和数据集上都超过了 PPO、DPO 和 RAFT。
- EN
Parallel-Synthesis: Direct KV-Cache Synthesis for Parallel Branches in LLM-Agent Workflows
Parallel-Synthesis is a plug-and-play framework that lets a synthesizer LLM directly consume the KV caches produced by parallel worker agents, avoiding redundant prefill and reducing time-to-first-token by 2.5–11× while matching or beating text-concatenation-based synthesis on 7 of 9 benchmarks.
- 中
Parallel-Synthesis:让 LLM 综合智能体直接消费并行分支的 KV 缓存
Parallel-Synthesis 提出了一套即插即用框架,让综合智能体的 LLM 直接复用并行 Worker 在解码时产生的 KV 缓存,从根本上消除了多智能体协作中代价高昂的重复预填充过程,在九个基准上的准确率与文本拼接基准持平或更优,首 token 延迟降低 2.5–11 倍。
- EN
GF-DiT: Scheduling GPU Parallelism as a First-Class Resource for Diffusion Transformer Serving
GF-DiT treats GPU parallelism as a schedulable resource for Diffusion Transformer serving, decomposing requests into reschedulable trajectory tasks and introducing group-free collectives that cut communication-group setup from 778 ms to 60 μs, achieving up to 6.01× throughput gains and 95% latency reduction over static-parallelism baselines.
- 中
GF-DiT:把 GPU 并行度当作可调度资源的扩散 Transformer 推理系统
GF-DiT 把 GPU 并行度从静态部署参数提升为运行时可调度的资源,通过轨迹任务图和无组通信原语(组建开销从 778 ms 降至 60 μs)实现弹性并行推理,吞吐量提升最高 6.01 倍,平均延迟降低最高 95%。