Tag

#LLM Serving

29 posts tagged with this label. Back to all tags or the main feed.

2026

06-28 EN

Moebius: Seamless Runtime Parallelism Switching for MoE LLM Serving
06-28 中

Moebius：为 MoE 大模型推理服务实现无缝运行时并行策略切换
06-25 EN

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving
06-25 中

ReMP：LLM 推理服务中的低停机运行时并行拓扑重配置
06-21 EN

Tutti: GPU-Centric SSD-Backed KV Cache That Finally Makes SSDs Practical for Long-Context LLM Serving
06-21 中

Tutti 阅读笔记：GPU 原生 SSD KV 缓存，让 NVMe 固态硬盘真正可用于长上下文大模型推理
06-18 EN

LUMEN: Load-Aware Coordinated Failure Recovery for Distributed LLM Serving
06-18 中

LUMEN：面向分布式大模型推理的负载感知协同故障恢复
06-14 EN

GF-DiT: Scheduling GPU Parallelism as a First-Class Resource for Diffusion Transformer Serving
06-14 中

GF-DiT：把 GPU 并行度当作可调度资源的扩散 Transformer 推理系统
06-07 EN

SlidingServe: SLO-Aware Sliding-Window Scheduling for LLM Inference
06-07 中

SlidingServe：面向LLM推理的SLO感知滑动窗口调度
05-28 EN

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
05-28 中

Mooncake：以 KV Cache 为核心的大模型推理服务解耦架构
05-21 EN

SGLang: Efficient Execution of Structured Language Model Programs — Technical Review
05-21 中

SGLang:为 LM 程序而生的前端 DSL + 协同设计运行时 —— 阅读笔记
05-20 EN

Sarathi-Serve: Taming the Throughput–Latency Tradeoff in LLM Inference — Technical Review
05-20 中

Sarathi-Serve:用 chunked-prefill 驯服 LLM 推理的吞吐-延迟权衡 —— 阅读笔记
05-17 EN

PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review
05-17 中

PipeSD：基于推测解码的云边协同流水线推理框架 —— 阅读笔记
05-16 EN

An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review
05-16 中

用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记
05-12 EN

DAPO: An Open-Source LLM Reinforcement Learning System at Scale
05-12 中

DAPO：大规模开源 LLM 强化学习系统
05-10 EN

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
05-10 中

Tutti：让基于 SSD 的 KV Cache 真正适用于长上下文 LLM Serving
04-09 EN

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Deep Technical Review
04-09 中

DistServe：通过 Prefill/Decoding 解耦实现面向 Goodput 的大模型服务优化 — 深度阅读笔记
02-19 EN

vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Technical Review