Page 5 / 17

200 posts in total. Keep on posting.

Showing posts 49–60 of 200. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.

2026

06-07 EN

SlidingServe: SLO-Aware Sliding-Window Scheduling for LLM Inference

SlidingServe introduces a sliding-window-based scheduler that combines a batch latency predictor, dynamic chunking, multi-level priority sorting, and DP-based batch construction to improve LLM serving capacity by up to 30% while cutting SLO violations by 16–53%.
06-07 中

SlidingServe：面向LLM推理的SLO感知滑动窗口调度

SlidingServe通过批次延迟预测器、动态分块机制、多级优先级排序和基于动态规划的批次构造，在保证SLO的同时将LLM在线推理的服务容量提升最高30%，SLO违约率降低16–53%。
06-06 EN

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1 demonstrates that frontier-level mathematical and code reasoning can emerge from pure reinforcement learning without any human-annotated reasoning trajectories, rivaling OpenAI o1 on AIME and Codeforces benchmarks.
06-06 中

DeepSeek-R1：用强化学习激发大语言模型的推理能力

DeepSeek-R1 证明，前沿级别的数学与代码推理能力可以从纯强化学习中涌现，无需任何人工标注的推理轨迹，在 AIME 和 Codeforces 上与 OpenAI o1 旗鼓相当。
06-04 EN

Llumnix: Dynamic Scheduling for Large Language Model Serving

Llumnix brings OS-style process rescheduling to LLM serving: it migrates in-flight requests and their KV cache across GPU instances for continuous load balancing, defragmentation, and SLO-aware priority — via a near-zero-downtime pre-copy migration mechanism and a virtual-usage abstraction, achieving up to 15x lower P99 TTFT and 36% cost savings on a 16-GPU cluster.
06-04 中

Llumnix：大语言模型推理服务的动态调度系统

Llumnix 把操作系统的进程上下文切换思路引入 LLM 推理服务，通过在线迁移请求及其 KV 缓存实现跨 GPU 实例的连续负载均衡——核心是近零停机的预拷贝迁移机制与统一五类调度目标的虚拟用量抽象，在 16 卡集群上将 P99 TTFT 降低最多 15 倍、成本节省 36%。
06-03 EN

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Technical Review

A deep technical review of KVQuant (NeurIPS 2024), which achieves sub-4-bit KV cache quantization enabling 10M context inference through per-channel key quantization, pre-RoPE quantization, sensitivity-weighted non-uniform datatypes, and per-vector dense-and-sparse outlier handling.
06-03 中

KVQuant：面向千万级上下文的 KV 缓存量化技术——阅读笔记

KVQuant（NeurIPS 2024，UC Berkeley）通过逐通道键量化、预-RoPE 量化、Fisher 加权非均匀量化数据类型（nuqX）和逐向量稠密-稀疏残差，实现 sub-4-bit KV 缓存量化，将上下文扩展至千万级别，3-bit 困惑度退化仅 0.07。
06-02 EN

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

REINFORCE++ proves that GRPO's per-prompt advantage normalization is a biased estimator, then fixes it with a single global batch normalization step — achieving state-of-the-art results across general RLHF, complex reasoning, and long-horizon agentic tasks, all without a critic network.
06-02 中

REINFORCE++：用全局优势归一化稳定免批评家策略优化

REINFORCE++ 从数学上证明了 GRPO 的逐 prompt 局部归一化是一个有偏估计量，并用全局批次归一化替换它——在通用 RLHF、复杂推理和长时序 agent 任务上全面超越 GRPO 和 PPO，同时无需任何批评家网络。
06-01 EN

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

AutoSci is a memory-centric agentic system that automates the full scientific research lifecycle — reading, ideation, experimentation, writing, and rebuttal — through four integrated modules (SciMem, SciFlow, SciDAG, SciEvolve). I walk through the architecture from scratch and critically assess its evaluation methodology and narrow domain coverage.
06-01 中

AutoSci：以记忆为中心的全科研生命周期自主智能体系统

AutoSci 是北大团队提出的「永久性科研环境」，用以记忆为中心的多智能体把读文献、提想法、做实验、写论文、回审稿人串成一个能自我进化的闭环。本文从零梳理它的四大模块（SciMem / SciFlow / SciDAG / SciEvolve）与两个端到端案例，并批判性分析其评测方法与适用边界上的局限。