Zhongzhu / Charlie
Home
Research
Publication
Experience
Recent News
Blog
CV
↗
Tag
#
RLHF
22 posts tagged with this label. Back to
all tags
or the
main feed
.
2026
06-23
EN
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
06-23
中
Critique-GRPO:用自然语言批评反馈突破强化学习训练瓶颈
06-16
EN
Back to Basics: Revisiting REINFORCE Style Optimization for RLHF (RLOO)
06-16
中
回归基础:用 RLOO 重新思考 RLHF 中的策略梯度优化
06-13
EN
ForeMoE: Micro-step-level MoE Load Balancing for RL Post-training via Routing Foresight
06-13
中
ForeMoE:利用路由预见性实现 RL 后训练中 MoE 微步级负载均衡
06-02
EN
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
06-02
中
REINFORCE++:用全局优势归一化稳定免批评家策略优化
05-26
EN
SimPO: Simple Preference Optimization with a Reference-Free Reward
05-26
中
SimPO:无需参考模型的简洁偏好优化
05-19
EN
KTO: Model Alignment as Prospect Theoretic Optimization — Technical Blog Review
05-19
中
KTO:把模型对齐看成「前景理论」优化 —— 阅读笔记
04-14
EN
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts — Deep Technical Review
04-14
中
ArmoRM:用“多目标奖励建模 + 混合专家门控”做可解释偏好学习——深度阅读笔记
04-07
EN
ORPO: Monolithic Preference Optimization without Reference Model — In-Depth Technical Review
04-07
中
ORPO:不用参考模型的一体化偏好优化 — 深度阅读笔记
03-31
EN
Constitutional AI: Harmlessness from AI Feedback — In-Depth Technical Review
03-24
EN
Proximal Policy Optimization Algorithms — In-Depth Technical Review
03-24
中
近端策略优化算法(PPO)— 深度阅读笔记
03-12
EN
PaRO: Smarter Partitioning for Distributed Training — Beyond ZeRO's One-Size-Fits-All
03-10
EN
InstructGPT: The RLHF Recipe That Turned GPT-3 Into a Helpful Assistant
02-17
EN
Direct Preference Optimization: Your Language Model Is Secretly a Reward Model — Technical Review