Zhongzhu / Charlie
Home
Research
Publication
Experience
Recent News
Blog
CV
↗
Tag
#
Reinforcement Learning
55 posts tagged with this label. Back to
all tags
or the
main feed
.
2026
06-30
EN
DAPO: An Open-Source LLM Reinforcement Learning System at Scale — Technical Review
06-30
中
DAPO:大规模 LLM 强化学习系统阅读笔记
06-29
EN
ACTS: Steering How LLMs Reason, Not Just How Long
06-29
中
ACTS:用强化学习训练的控制器,让 LLM 推理更聪明而不只是更短
06-23
EN
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
06-23
中
Critique-GRPO:用自然语言批评反馈突破强化学习训练瓶颈
06-16
EN
Back to Basics: Revisiting REINFORCE Style Optimization for RLHF (RLOO)
06-16
中
回归基础:用 RLOO 重新思考 RLHF 中的策略梯度优化
06-13
EN
ForeMoE: Micro-step-level MoE Load Balancing for RL Post-training via Routing Foresight
06-13
中
ForeMoE:利用路由预见性实现 RL 后训练中 MoE 微步级负载均衡
06-09
EN
VAPO: Value-Augmented Proximal Policy Optimization for Long-CoT Reasoning
06-09
中
VAPO:面向长链推理的价值增强近端策略优化
06-06
EN
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
06-06
中
DeepSeek-R1:用强化学习激发大语言模型的推理能力
06-02
EN
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
06-02
中
REINFORCE++:用全局优势归一化稳定免批评家策略优化
05-31
EN
Group Sequence Policy Optimization: A Sequence-Level RL Algorithm for Training Large Language Models
05-31
中
Group Sequence Policy Optimization:序列级重要性采样修正 GRPO 的 RL 训练方法
05-26
EN
SimPO: Simple Preference Optimization with a Reference-Free Reward
05-26
中
SimPO:无需参考模型的简洁偏好优化
05-23
EN
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
05-23
中
DeepSeek-R1:用强化学习激发大语言模型的推理能力
05-19
EN
KTO: Model Alignment as Prospect Theoretic Optimization — Technical Blog Review
05-19
中
KTO:把模型对齐看成「前景理论」优化 —— 阅读笔记
05-12
EN
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
05-12
中
DAPO:大规模开源 LLM 强化学习系统
05-11
EN
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
05-11
中
MASPO:面向 LLM 多智能体系统的联合提示词优化
05-09
EN
Queueing Stability for LLM Inference with KV Cache Memory Constraints
05-01
EN
Low-Rank Optimization Trajectories for LLM RLVR Acceleration: A Technical Review of NExt
04-26
EN
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
04-24
EN
Generalization at the Edge of Stability: A Random Dynamical Systems Perspective
03-24
EN
Proximal Policy Optimization Algorithms — In-Depth Technical Review
03-24
中
近端策略优化算法(PPO)— 深度阅读笔记
03-23
EN
MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Technical Review
03-10
EN
InstructGPT: The RLHF Recipe That Turned GPT-3 Into a Helpful Assistant
02-20
EN
DeepSeekMath: How 120B Tokens of Math Data and GRPO Rival GPT-4 on Competition Problems
02-17
EN
Direct Preference Optimization: Your Language Model Is Secretly a Reward Model — Technical Review
2022
02-03
EN
Reinforcement Learning-Principle-Day12
2021
11-14
EN
Reinforcement Learning-Principle-Day11
11-07
EN
Reinforcement Learning-Principle-Day10
11-04
EN
MetaLearning-Standford-Lecture5
10-31
EN
Reinforcement Learning-Principle-Day9
10-20
EN
Reinforcement Learning-Principle-Day8
10-13
EN
Reinforcement Learning-Principle-Day7
09-29
EN
Reinforcement Learning-Principle-Day6
07-21
EN
Reinforcement Learning-Principle-Day5
04-14
EN
MetaLearning-Standford-Lecture4
03-05
EN
Reinforcement Learning Principle Day4
2020
12-02
EN
Reinforcement Learning-Principle-Day3
11-12
EN
MetaLearning-Standford-Lecture3
11-04
EN
MetaLearning-Standford-Lecture2
10-30
EN
Reinforcement Learning-Principle-Day2
08-23
EN
Reinforcement Learning-Principle-Day1
2019
11-24
EN
Reinforcement Learning\_WatermelonBook\_Summary