Zhongzhu / Charlie
Home
Research
Publication
Experience
Recent News
Blog
CV
↗
Tag
#
LLM Inference
72 posts tagged with this label. Back to
all tags
or the
main feed
.
2026
07-01
EN
SSV: Sparse Speculative Verification for Efficient LLM Inference
07-01
中
SSV:稀疏投机验证——在动态稀疏注意力中做投机解码
06-29
EN
ACTS: Steering How LLMs Reason, Not Just How Long
06-29
中
ACTS:用强化学习训练的控制器,让 LLM 推理更聪明而不只是更短
06-28
EN
Moebius: Seamless Runtime Parallelism Switching for MoE LLM Serving
06-28
中
Moebius:为 MoE 大模型推理服务实现无缝运行时并行策略切换
06-27
EN
JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
06-27
中
JetSpec:用并行树草稿突破推测解码的扩展上限
06-26
EN
SigmaScale: Learning to Scale Weight Matrices for Better SVD-Based LLM Compression
06-26
中
SigmaScale 阅读笔记:通过学习缩放矩阵改进 SVD 大语言模型压缩
06-25
EN
ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving
06-25
中
ReMP:LLM 推理服务中的低停机运行时并行拓扑重配置
06-18
EN
LUMEN: Load-Aware Coordinated Failure Recovery for Distributed LLM Serving
06-18
中
LUMEN:面向分布式大模型推理的负载感知协同故障恢复
06-17
EN
OScaR: Occam's Razor for Extreme KV Cache Quantization
06-17
中
OScaR:极端 KV 缓存量化的奥卡姆剃刀
06-15
EN
Parallel-Synthesis: Direct KV-Cache Synthesis for Parallel Branches in LLM-Agent Workflows
06-15
中
Parallel-Synthesis:让 LLM 综合智能体直接消费并行分支的 KV 缓存
06-14
EN
GF-DiT: Scheduling GPU Parallelism as a First-Class Resource for Diffusion Transformer Serving
06-14
中
GF-DiT:把 GPU 并行度当作可调度资源的扩散 Transformer 推理系统
06-12
EN
SliceGPT: Post-Training LLM Compression via Computational Invariance
06-12
中
SliceGPT 阅读笔记:用计算不变性删除 Transformer 的行与列
06-10
EN
KeepKV: Lossless KV Cache Compression via Electoral Votes and ZIP-Merging
06-10
中
KeepKV:用「选举票」机制和零扰动合并实现无损 KV 缓存压缩
06-08
EN
ExpWeaver: How LLM Agents Learn from Past Experience in Latent Space
06-08
中
ExpWeaver:LLM 智能体如何在隐空间中从经验中学习
06-07
EN
SlidingServe: SLO-Aware Sliding-Window Scheduling for LLM Inference
06-07
中
SlidingServe:面向LLM推理的SLO感知滑动窗口调度
06-04
EN
Llumnix: Dynamic Scheduling for Large Language Model Serving
06-04
中
Llumnix:大语言模型推理服务的动态调度系统
06-03
EN
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Technical Review
06-03
中
KVQuant:面向千万级上下文的 KV 缓存量化技术——阅读笔记
05-29
EN
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
05-29
中
IO-SVD:基于输入输出双侧白化的自适应秩LLM压缩方法
05-28
EN
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
05-28
中
Mooncake:以 KV Cache 为核心的大模型推理服务解耦架构
05-24
EN
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
05-24
中
FlashAttention-2:更好的并行策略与线程块工作划分
05-21
EN
SGLang: Efficient Execution of Structured Language Model Programs — Technical Review
05-21
中
SGLang:为 LM 程序而生的前端 DSL + 协同设计运行时 —— 阅读笔记
05-20
EN
Sarathi-Serve: Taming the Throughput–Latency Tradeoff in LLM Inference — Technical Review
05-20
中
Sarathi-Serve:用 chunked-prefill 驯服 LLM 推理的吞吐-延迟权衡 —— 阅读笔记
05-17
EN
PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review
05-17
中
PipeSD:基于推测解码的云边协同流水线推理框架 —— 阅读笔记
05-16
EN
An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review
05-16
中
用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记
05-10
EN
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
05-10
中
Tutti:让基于 SSD 的 KV Cache 真正适用于长上下文 LLM Serving
05-09
EN
Queueing Stability for LLM Inference with KV Cache Memory Constraints
04-29
EN
FEPLB Technical Review: Nearly Free MoE Load Balancing with the NVLink Copy Engine
04-24
EN
Generalization at the Edge of Stability: A Random Dynamical Systems Perspective
04-24
EN
FEPLB: Zero-Cost MoE Load Balancing via NVLink Copy Engine
04-22
EN
SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference Under Hard Uplink Budgets
04-19
EN
SpecGuard: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
04-19
中
SpecGuard:用于多步推理的验证感知推测解码
04-17
EN
GRASP Technical Review: Replacing Redundant LLM Layers with Adaptive Singular Parameters
04-15
EN
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding — Deep Technical Review
04-15
中
LayerSkip:让大模型“提前退出 + 自校验推理”成为可部署方案——深度阅读笔记
04-10
EN
SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression — Deep Technical Review
04-10
中
SVD-LLM:面向大语言模型压缩的“截断感知”奇异值分解方法 — 深度阅读笔记
04-08
EN
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — In-Depth Technical Review
04-08
中
SmoothQuant:大型语言模型的精准高效训练后量化 — 深度阅读笔记
04-03
EN
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — In-Depth Technical Review
04-03
中
AWQ:感知激活值的大模型权重量化压缩与加速 — 深度阅读笔记
04-01
EN
Layer Pruning for Efficient Large Language Models — In-Depth Technical Review
03-27
EN
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — In-Depth Technical Review
03-25
EN
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review
03-21
EN
BitNet: Scaling 1-bit Transformers for Large Language Models — In-Depth Technical Review
03-14
EN
FlashAttention: The IO-Aware Algorithm That Made Transformers Actually Fast
03-11
EN
Speculative Decoding: Making LLM Inference 2-3× Faster Without Losing a Single Token
02-19
EN
vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Technical Review
02-18
EN
DeepSeek-V2: Multi-head Latent Attention and DeepSeekMoE — Technical Review