Tag

#LLM Inference

72 posts tagged with this label. Back to all tags or the main feed.

2026

07-01 EN

SSV: Sparse Speculative Verification for Efficient LLM Inference
07-01 中

SSV：稀疏投机验证——在动态稀疏注意力中做投机解码
06-29 EN

ACTS: Steering How LLMs Reason, Not Just How Long
06-29 中

ACTS：用强化学习训练的控制器，让 LLM 推理更聪明而不只是更短
06-28 EN

Moebius: Seamless Runtime Parallelism Switching for MoE LLM Serving
06-28 中

Moebius：为 MoE 大模型推理服务实现无缝运行时并行策略切换
06-27 EN

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
06-27 中

JetSpec：用并行树草稿突破推测解码的扩展上限
06-26 EN

SigmaScale: Learning to Scale Weight Matrices for Better SVD-Based LLM Compression
06-26 中

SigmaScale 阅读笔记：通过学习缩放矩阵改进 SVD 大语言模型压缩
06-25 EN

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving
06-25 中

ReMP：LLM 推理服务中的低停机运行时并行拓扑重配置
06-18 EN

LUMEN: Load-Aware Coordinated Failure Recovery for Distributed LLM Serving
06-18 中

LUMEN：面向分布式大模型推理的负载感知协同故障恢复
06-17 EN

OScaR: Occam's Razor for Extreme KV Cache Quantization
06-17 中

OScaR：极端 KV 缓存量化的奥卡姆剃刀
06-15 EN

Parallel-Synthesis: Direct KV-Cache Synthesis for Parallel Branches in LLM-Agent Workflows
06-15 中

Parallel-Synthesis：让 LLM 综合智能体直接消费并行分支的 KV 缓存
06-14 EN

GF-DiT: Scheduling GPU Parallelism as a First-Class Resource for Diffusion Transformer Serving
06-14 中

GF-DiT：把 GPU 并行度当作可调度资源的扩散 Transformer 推理系统
06-12 EN

SliceGPT: Post-Training LLM Compression via Computational Invariance
06-12 中

SliceGPT 阅读笔记：用计算不变性删除 Transformer 的行与列
06-10 EN

KeepKV: Lossless KV Cache Compression via Electoral Votes and ZIP-Merging
06-10 中

KeepKV：用「选举票」机制和零扰动合并实现无损 KV 缓存压缩
06-08 EN

ExpWeaver: How LLM Agents Learn from Past Experience in Latent Space
06-08 中

ExpWeaver：LLM 智能体如何在隐空间中从经验中学习
06-07 EN

SlidingServe: SLO-Aware Sliding-Window Scheduling for LLM Inference
06-07 中

SlidingServe：面向LLM推理的SLO感知滑动窗口调度
06-04 EN

Llumnix: Dynamic Scheduling for Large Language Model Serving
06-04 中

Llumnix：大语言模型推理服务的动态调度系统
06-03 EN

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Technical Review
06-03 中

KVQuant：面向千万级上下文的 KV 缓存量化技术——阅读笔记
05-29 EN

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
05-29 中

IO-SVD：基于输入输出双侧白化的自适应秩LLM压缩方法
05-28 EN

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
05-28 中

Mooncake：以 KV Cache 为核心的大模型推理服务解耦架构
05-24 EN

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
05-24 中

FlashAttention-2：更好的并行策略与线程块工作划分
05-21 EN

SGLang: Efficient Execution of Structured Language Model Programs — Technical Review
05-21 中

SGLang:为 LM 程序而生的前端 DSL + 协同设计运行时 —— 阅读笔记
05-20 EN

Sarathi-Serve: Taming the Throughput–Latency Tradeoff in LLM Inference — Technical Review
05-20 中

Sarathi-Serve:用 chunked-prefill 驯服 LLM 推理的吞吐-延迟权衡 —— 阅读笔记
05-17 EN

PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review
05-17 中

PipeSD：基于推测解码的云边协同流水线推理框架 —— 阅读笔记
05-16 EN

An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review
05-16 中

用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记
05-10 EN

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
05-10 中

Tutti：让基于 SSD 的 KV Cache 真正适用于长上下文 LLM Serving
05-09 EN

Queueing Stability for LLM Inference with KV Cache Memory Constraints
04-29 EN

FEPLB Technical Review: Nearly Free MoE Load Balancing with the NVLink Copy Engine
04-24 EN

Generalization at the Edge of Stability: A Random Dynamical Systems Perspective
04-24 EN

FEPLB: Zero-Cost MoE Load Balancing via NVLink Copy Engine
04-22 EN

SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference Under Hard Uplink Budgets
04-19 EN

SpecGuard: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
04-19 中

SpecGuard：用于多步推理的验证感知推测解码
04-17 EN

GRASP Technical Review: Replacing Redundant LLM Layers with Adaptive Singular Parameters
04-15 EN

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding — Deep Technical Review
04-15 中

LayerSkip：让大模型“提前退出 + 自校验推理”成为可部署方案——深度阅读笔记
04-10 EN

SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression — Deep Technical Review
04-10 中

SVD-LLM：面向大语言模型压缩的“截断感知”奇异值分解方法 — 深度阅读笔记
04-08 EN

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — In-Depth Technical Review
04-08 中

SmoothQuant：大型语言模型的精准高效训练后量化 — 深度阅读笔记
04-03 EN

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — In-Depth Technical Review
04-03 中

AWQ：感知激活值的大模型权重量化压缩与加速 — 深度阅读笔记
04-01 EN

Layer Pruning for Efficient Large Language Models — In-Depth Technical Review
03-27 EN

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — In-Depth Technical Review
03-25 EN

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review
03-21 EN

BitNet: Scaling 1-bit Transformers for Large Language Models — In-Depth Technical Review
03-14 EN

FlashAttention: The IO-Aware Algorithm That Made Transformers Actually Fast
03-11 EN

Speculative Decoding: Making LLM Inference 2-3× Faster Without Losing a Single Token
02-19 EN

vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Technical Review
02-18 EN

DeepSeek-V2: Multi-head Latent Attention and DeepSeekMoE — Technical Review