Page 6 / 10
116 posts in total. Keep on posting.
Showing posts 61–72 of 116. Each entry opens locally on this site; legacy Hexo posts link back to their original article at the bottom for reference.
2026
- EN
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review
GPTQ enables efficient post-training quantization of large language models to 3-4 bits with minimal accuracy loss. Covers the layer-wise quantization algorithm, Hessian-based error correction, and practical deployment.
- EN
Proximal Policy Optimization Algorithms — In-Depth Technical Review
PPO is one of the most influential RL algorithms. This review covers policy gradients, TRPO, the clipped surrogate objective, and PPO's role in RLHF/LLM alignment.
- 中
近端策略优化算法(PPO)— 深度阅读笔记
PPO(近端策略优化)是深度学习时代最具影响力的强化学习算法之一。本文从零开始详细讲解策略梯度、TRPO 到 PPO 裁剪目标的完整推导,覆盖 MuJoCo、Atari 实验分析,以及 PPO 在 RLHF/LLM 对齐中的核心作用。
- EN
MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Technical Review
A detailed technical review of Google DeepMind's paper 'A Subgoal-driven Framework for Improving Long-Horizon LLM Agents', analyzing how MiRA uses milestone-based subgoal decomposition and potential-based reward shaping to overcome planning bottlenecks in long-horizon web navigation.
- EN
Attention Is All You Need: The Transformer — In-Depth Technical Review
An in-depth technical review of the Transformer architecture from "Attention Is All You Need", covering self-attention mechanics, positional encoding, multi-head attention, training details, and lasting impact.
- EN
BitNet: Scaling 1-bit Transformers for Large Language Models — In-Depth Technical Review
An in-depth technical review of BitNet, covering 1-bit weight quantization, BitLinear layer design, scaling laws for binary Transformers, and practical deployment implications.
- EN
ZeRO: Shattering the Memory Wall — How DeepSpeed Trains Trillion-Parameter Models
A technical review of ZeRO (Zero Redundancy Optimizer), analyzing how partitioning optimizer states, gradients, and parameters across data-parallel processes enables training of trillion-parameter models.
- EN
MetaGPT: When LLM Agents Form a Software Company — Multi-Agent Collaboration Done Right
A technical review of MetaGPT, analyzing how encoding human software development workflows (SOP) into multi-agent systems with structured communication reduces errors in automated code generation.
- EN
FlashAttention: The IO-Aware Algorithm That Made Transformers Actually Fast
A technical review of FlashAttention, analyzing how IO-aware tiling and kernel fusion achieve exact attention computation that is both faster and more memory-efficient than standard implementations.
- EN
LoRA: Fine-Tuning Giant Models with Pocket Change — The Low-Rank Revolution
A technical review of LoRA (Low-Rank Adaptation), analyzing how injecting trainable low-rank decomposition matrices enables parameter-efficient fine-tuning of large language models with minimal overhead.
- EN
Megatron-LM: NVIDIA's Blueprint for Training Billion-Parameter Models at Scale
A technical review of Megatron-LM's efficient large-scale training system, analyzing how tensor, pipeline, and data parallelism are composed to train trillion-parameter models on GPU clusters.
- EN
PaRO: Smarter Partitioning for Distributed Training — Beyond ZeRO's One-Size-Fits-All
A technical review of PaRO, analyzing how partial redundancy optimization in data-parallel training reduces memory overhead while minimizing communication costs through selective parameter partitioning.