Zhongzhu / Charlie
Home
Research
Publication
Experience
Recent News
Blog
CV
↗
Tag
#
LLM Serving
29 posts tagged with this label. Back to
all tags
or the
main feed
.
2026
06-28
EN
Moebius: Seamless Runtime Parallelism Switching for MoE LLM Serving
06-28
中
Moebius:为 MoE 大模型推理服务实现无缝运行时并行策略切换
06-25
EN
ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving
06-25
中
ReMP:LLM 推理服务中的低停机运行时并行拓扑重配置
06-21
EN
Tutti: GPU-Centric SSD-Backed KV Cache That Finally Makes SSDs Practical for Long-Context LLM Serving
06-21
中
Tutti 阅读笔记:GPU 原生 SSD KV 缓存,让 NVMe 固态硬盘真正可用于长上下文大模型推理
06-18
EN
LUMEN: Load-Aware Coordinated Failure Recovery for Distributed LLM Serving
06-18
中
LUMEN:面向分布式大模型推理的负载感知协同故障恢复
06-14
EN
GF-DiT: Scheduling GPU Parallelism as a First-Class Resource for Diffusion Transformer Serving
06-14
中
GF-DiT:把 GPU 并行度当作可调度资源的扩散 Transformer 推理系统
06-07
EN
SlidingServe: SLO-Aware Sliding-Window Scheduling for LLM Inference
06-07
中
SlidingServe:面向LLM推理的SLO感知滑动窗口调度
05-28
EN
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
05-28
中
Mooncake:以 KV Cache 为核心的大模型推理服务解耦架构
05-21
EN
SGLang: Efficient Execution of Structured Language Model Programs — Technical Review
05-21
中
SGLang:为 LM 程序而生的前端 DSL + 协同设计运行时 —— 阅读笔记
05-20
EN
Sarathi-Serve: Taming the Throughput–Latency Tradeoff in LLM Inference — Technical Review
05-20
中
Sarathi-Serve:用 chunked-prefill 驯服 LLM 推理的吞吐-延迟权衡 —— 阅读笔记
05-17
EN
PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review
05-17
中
PipeSD:基于推测解码的云边协同流水线推理框架 —— 阅读笔记
05-16
EN
An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review
05-16
中
用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记
05-12
EN
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
05-12
中
DAPO:大规模开源 LLM 强化学习系统
05-10
EN
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
05-10
中
Tutti:让基于 SSD 的 KV Cache 真正适用于长上下文 LLM Serving
04-09
EN
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Deep Technical Review
04-09
中
DistServe:通过 Prefill/Decoding 解耦实现面向 Goodput 的大模型服务优化 — 深度阅读笔记
02-19
EN
vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Technical Review