Zhongzhu / Charlie
Home
Research
Publication
Experience
Recent News
Blog
CV
↗
Tag
#
LLM Inference
30 posts tagged with this label. Back to
all tags
or the
main feed
.
2026
05-17
EN
PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review
05-17
中
PipeSD:基于推测解码的云边协同流水线推理框架 —— 阅读笔记
05-16
EN
An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review
05-16
中
用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记
05-10
EN
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
05-10
中
Tutti:让基于 SSD 的 KV Cache 真正适用于长上下文 LLM Serving
05-09
EN
Queueing Stability for LLM Inference with KV Cache Memory Constraints
04-29
EN
FEPLB Technical Review: Nearly Free MoE Load Balancing with the NVLink Copy Engine
04-24
EN
Generalization at the Edge of Stability: A Random Dynamical Systems Perspective
04-24
EN
FEPLB: Zero-Cost MoE Load Balancing via NVLink Copy Engine
04-22
EN
SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference Under Hard Uplink Budgets
04-19
EN
SpecGuard: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
04-19
中
SpecGuard:用于多步推理的验证感知推测解码
04-17
EN
GRASP Technical Review: Replacing Redundant LLM Layers with Adaptive Singular Parameters
04-15
EN
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding — Deep Technical Review
04-15
中
LayerSkip:让大模型“提前退出 + 自校验推理”成为可部署方案——深度阅读笔记
04-10
EN
SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression — Deep Technical Review
04-10
中
SVD-LLM:面向大语言模型压缩的“截断感知”奇异值分解方法 — 深度阅读笔记
04-08
EN
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — In-Depth Technical Review
04-08
中
SmoothQuant:大型语言模型的精准高效训练后量化 — 深度阅读笔记
04-03
EN
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — In-Depth Technical Review
04-03
中
AWQ:感知激活值的大模型权重量化压缩与加速 — 深度阅读笔记
04-01
EN
Layer Pruning for Efficient Large Language Models — In-Depth Technical Review
03-27
EN
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — In-Depth Technical Review
03-25
EN
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review
03-21
EN
BitNet: Scaling 1-bit Transformers for Large Language Models — In-Depth Technical Review
03-14
EN
FlashAttention: The IO-Aware Algorithm That Made Transformers Actually Fast
03-11
EN
Speculative Decoding: Making LLM Inference 2-3× Faster Without Losing a Single Token
02-19
EN
vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Technical Review
02-18
EN
DeepSeek-V2: Multi-head Latent Attention and DeepSeekMoE — Technical Review