Page 2 / 10
116 posts in total. Keep on posting.
Showing posts 13–24 of 116. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- 中
Tutti:让基于 SSD 的 KV Cache 真正适用于长上下文 LLM Serving
一篇关于 Tutti 的中文阅读笔记:它从 GPU-native KV cache object store、GPU io_uring 与 slack-aware scheduling 出发,让 SSD-backed KV cache 更适合长上下文 LLM serving。
- EN
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
A detailed technical review of Tutti, a GPU-centric SSD-backed KV cache system that makes long-context LLM serving cache reuse practical.
- EN
Queueing Stability for LLM Inference with KV Cache Memory Constraints
A detailed technical review of a queueing-theoretic framework for predicting LLM inference stability under KV cache memory constraints.
- EN
Swift-SVD: Activation-Aware Low-Rank Compression for LLM Weights and KV Cache
A detailed technical review of Swift-SVD, an activation-aware low-rank compression method for LLM weights and KV cache that uses output covariance eigendecomposition to avoid expensive generalized SVD.
- EN
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
A detailed technical review of Piper, a resource-model-driven system for large-scale MoE training with pipelined hybrid parallelism, HALO hierarchical all-to-all, and topology-aware expert placement.
- EN
Low-Rank Optimization Trajectories for LLM RLVR Acceleration: A Technical Review of NExt
A detailed technical review of NExt, a method that models low-rank optimization trajectories to accelerate reinforcement learning with verifiable rewards for large language models.
- EN
FEPLB Technical Review: Nearly Free MoE Load Balancing with the NVLink Copy Engine
A detailed technical review of FEPLB, a system that uses Hopper NVLink Copy Engines to perform fine-grained MoE load balancing with little interference to normal expert-parallel training.
- EN
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond — Technical Review
A technical review of agentic world modeling, covering capability levels, governing-law regimes, evaluation, and why decision-centric world models matter for LLM agents.
- EN
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
A comprehensive technical review of SAGE, analyzing how to optimize semantic evidence composition for edge-cloud systems under hard uplink budget constraints. The paper challenges importance-only patch selection and proposes a training-free method combining importance filtering with diversity-maximizing sampling.
- EN
FEPLB: Zero-Cost MoE Load Balancing via NVLink Copy Engine
How to reduce MoE token imbalance from 18.6% GPU waste to 51-70% improvement using hardware that was previously idle.
- EN
Generalization at the Edge of Stability: A Random Dynamical Systems Perspective
1. What This Paper Does Core Problem The edge of stability phenomenon, discovered by Cohen et al. (2021), presents a theoretical puzzle: when training with sufficiently large learning rates η, the largest Hessian eigenvalue λ₁ frequently exceeds the stability threshold 2/η, implying the system should diverge according to classical optimization theory. Yet empirically: Training loss continues to decrease Model generalization often improves in this regime The optimizer doesn't settle at a point but explores a bounded, chaotic set Prior explanations relying on pointwise properties (Hessian trace, spectral norm) fail to capture this phenomenon because they ignore the ensemble behavior of the attractor set. Main Contribution The paper's central insight: characterize generalization through the geometric properties of the random attractor itself, not individual solutions. They prove that: Sharpness Dimension (SD) < ambient dimension d with high probability at EoS Worst-case generalization error depends on SD, not parameter count d The complete Hessian spectrum structure matters, not just the trace or largest eigenvalue The attractor forms a fractal set with intrinsic dimension strictly smaller than the parameter space This explains why overparameterized models generalize: the training dynamics naturally compress into a lower-dimensional manifold despite the high-dimensional parameter space.
- EN
SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference Under Hard Uplink Budgets
Paper: Choi & Park, arXiv:2604.19623 (April 2026) Focus: Efficient inference in edge-cloud hybrid systems through optimal evidence composition Key Contribution: Demonstrates that coverage-aware patch selection outperforms importance-only methods under hard bandwidth constraints What This Paper Does This paper addresses a practical but underexplored problem in edge-cloud inference systems: how should the edge device select which image patches to transmit to the server when the uplink channel strictly limits the number of patches per request? The standard approach—selecting patches by importance (attention score)—turns out to be fundamentally limited. The paper shows that this creates "coverage gaps": high-attention patches cluster in the same semantic region, wasting budget on overlapping information. SAGE proposes a simple but effective alternative that combines importance filtering with diversity-maximizing sampling, achieving 93% of the server's full-transmission accuracy while sending fewer than half the patches. The insight is elegant: under hard budgets, every transmitted patch must count, so we should prioritize information coverage alongside importance.