May 19, 2026 EN #RLHF #Reinforcement Learning #LLM Training

KTO: Model Alignment as Prospect Theoretic Optimization — Technical Blog Review

Review date: 2026-05-19 Review author: Zhongzhu Zhou Paper reviewed: KTO: Model Alignment as Prospect Theoretic Optimization Paper authors: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela (Stanford / Contextual AI) arXiv: 2402.01306v4, last revised 2024-11-19; published at ICML 2024

Short answer

KTO is a paper about a deceptively simple question: do we really need paired preference data to align large language models? By 2024 the conventional answer was yes. RLHF (Christiano-style) used a learned reward model trained on pairs (yw ≻ yl | x), and DPO collapsed the entire two-stage pipeline into one cross-entropy-style loss that still consumes preference pairs. Almost every successful post-DPO objective in the open-source ecosystem (IPO, ORPO, SimPO) followed the same data shape: two responses per prompt, one preferred, one dispreferred.

KTO challenges this from two angles at once.

The first angle is theoretical: the authors reread DPO and PPO-Clip through Kahneman and Tversky’s prospect theory, the same theory that economists use to model how humans value risky gambles. They argue that these loss functions implicitly encode a human value function: a non-decreasing, concave-in-gains, often convex-in-losses curve that turns a reward into a perceived utility. They formalize this with a class they call Human-Aware Losses (HALOs) and show that DPO, PPO-Clip, and a few other “winning” methods all belong to this family, while non-HALO losses like SLiC and conditional SFT (CSFT) do not. Empirically, in a clean apples-to-apples sweep across Pythia-1.4B to Llama-30B, HALOs systematically beat non-HALOs at 13B and above.

The second angle is practical: armed with the HALO framework, the authors derive a new HALO that takes the Kahneman-Tversky value function literally. They call it Kahneman-Tversky Optimization (KTO). It needs only a binary signal: is this (x, y) pair desirable or undesirable? No pairs, no rankings, no log-likelihood-of-preferences. The headline empirical result is that KTO matches or exceeds DPO from 1B to 30B, including on benchmarks like GSM8K where it improves over DPO by 13.5 points when applied to Zephyr-β-SFT on UltraFeedback. It also stays robust under heavy class imbalance — up to a 1:10 desirable:undesirable ratio — by reweighting losses through two hyperparameters λD and λU.

My main takeaway is that KTO reframes alignment as a question about inductive bias rather than data format. The success of DPO is not just because preference data is informative; it is because the DPO loss happens to be shaped like a human-aware loss. Once you isolate that property, you can recover much of DPO’s behavior using cheaper feedback. That has real consequences for how teams collect alignment data in production — thumbs-up/thumbs-down logs from real users are far more abundant than carefully curated preference pairs, and KTO turns them into a first-class signal.

1. Prerequisites

This section explains the background a reader should hold in mind before tackling KTO. I will keep math compact but cover enough to make the derivation in §3 readable.

1.1 The classic three-stage alignment pipeline (RLHF + DPO)

By 2024, aligning an instruction-following LLM had a standard recipe:

pretrain  →  supervised fine-tune (SFT)  →  preference optimization
   π0              πref                         πθ aligned

The third stage is where alignment “happens.” There are two dominant choices.

RLHF (Christiano 2017; Ouyang 2022). Train a reward model rφ(x, y) from a dataset of preference pairs D = {(x, yw, yl)} under the Bradley-Terry assumption:

p*(yw ≻ yl | x) = σ(r*(x, yw) − r*(x, yl))

then minimize LR(rφ) = −E[log σ(rφ(x, yw) − rφ(x, yl))]. With rφ in hand, run PPO to maximize expected reward while a KL-divergence penalty keeps the policy πθ close to πref:

E_{x∈D, y∼πθ}[rφ(x, y)] − β · KL(πθ(y|x) ∥ πref(y|x))

This works, but it is famously fiddly: reward hacking, distributional shift, value-network instabilities, slow rollouts, and a lot of moving pieces.

DPO (Rafailov 2023). Rafailov et al. showed that under the same Bradley-Terry assumption and the same KL-constrained RL objective, the optimal policy admits a closed-form reparameterization: r*(x, y) = β · log(π*(y|x) / πref(y|x)) + β · log Z(x). Plug this back into the Bradley-Terry log-likelihood and you obtain the DPO loss, which is a single cross-entropy-style objective on preference pairs and never needs an explicit reward model or PPO:

LDPO = −E[log σ(β · log(πθ(yw|x)/πref(yw|x)) − β · log(πθ(yl|x)/πref(yl|x)))]

DPO became the de facto baseline because it is stable, easy to implement, and reproduces RLHF behavior on most benchmarks at a fraction of the engineering cost.

Both methods consume paired preference data. In practice that is the bottleneck.

1.2 Why paired preferences are expensive

Collecting (x, yw, yl) triples requires a labeler to compare two responses for the same prompt. Three things make this costly:

The labeler must read both responses end to end.
Disagreement and intransitivity are common — for hard prompts, a single annotator’s “preferred” choice can be the minority view (papers report up to ~30% noise on subjective tasks).
Most production telemetry is not paired. Real users click thumbs-up, mark a response as helpful, retry, or abandon. They almost never give you a second response to rank.

This is the gap KTO targets. If we could align directly on binary signals (y is good vs. y is bad), each prompt becomes one labeled example instead of one pair, doubling effective data volume in the best case and unlocking a much larger pool of organic feedback.

1.3 Prospect theory in 60 seconds

Prospect theory (Kahneman & Tversky 1979; Tversky & Kahneman 1992) explains why humans do not behave like expected-utility maximizers when faced with monetary gambles. Two empirical regularities matter for KTO:

Reference dependence. Humans evaluate outcomes relative to a reference point z0 (their current wealth, expectations, recent experience), not in absolute terms. A $60 gain feels great if your reference is $50 and disappointing if it is $80.
Loss aversion and diminishing sensitivity. Losses sting more than equivalent gains feel good (factor ~2.25 in classic studies), and both gains and losses suffer diminishing marginal sensitivity as their magnitude grows. The Tversky-Kahneman value function captures this with a piecewise functional form:

v(z; λ, α, z0) = (z − z0)^α if z ≥ z0 v(z; λ, α, z0) = −λ · (z0 − z)^α if z < z0

with median individual parameters α ≈ 0.88 and λ ≈ 2.25.

The shape is famous: an S-curve through the origin (at z = z0), concave in the gain regime, convex and steeper in the loss regime. KTO will replace this functional form with a logistic for numerical stability, but the shape — reference-dependent, concave gains, sharper losses — is preserved.

1.4 What does this have to do with LLMs?

The bridge from Kahneman-Tversky to LLMs is the implied reward that any RLHF-style method assigns to a generation. If πθ is the aligned policy and πref is the reference, define

rθ(x, y) = β · log(πθ(y|x) / πref(y|x))

This is just the DPO log-ratio reward. It is positive when the aligned model is more likely than the reference to emit y, negative when less likely. Crucially, this rθ plays exactly the role that “money” plays in a Kahneman-Tversky experiment. If we plug it into a Kahneman-Tversky-style value function (subtracting an appropriate reference point), we get the perceived value of generation y from a human’s point of view. That is the seed of the HALO definition.

2. What this paper does — the core idea

The paper makes three connected contributions.

(A) A framework: Human-Aware Losses (HALOs). A loss is a HALO if it can be written as

f(πθ, πref) = E_{x,y∼D}[ ax,y · v( rθ(x, y) − E_Q[rθ(x, y')] ) ] + CD

with v non-decreasing everywhere and concave in the positive regime, Q a reference distribution over y', ax,y ∈ {−1, +1} an outcome sign, and CD a data-specific constant. The crucial structural features are:

The reward is relative — subtracted from an expected reward under a reference distribution. This is the prospect-theoretic reference point.
The value function v is concave in gains — it diminishes sensitivity for very good outcomes, encoding the human intuition that going from “great” to “spectacular” feels smaller than from “ok” to “great.”

The authors prove (Theorem 3.5) that DPO and offline PPO-Clip are HALOs. They identify CSFT (Conditional SFT) and SLiC (Sequence Likelihood Calibration) as non-HALOs because their losses cannot be written in this form — most importantly, they have no reference point that depends on πref.

(B) An empirical claim: HALOs beat non-HALOs. In Figure 2 they sweep Pythia-{1.4B, 2.8B, 6.9B, 12B} and Llama-{7B, 13B, 30B} under matched data and training settings, evaluating GPT-4-judged winrate against the SFT target. HALOs and non-HALOs are indistinguishable up to 7B; from 13B upward, HALOs become significantly better (p < 0.05 after multiple-comparison correction), and only HALO-aligned models clear the 50% winrate bar. Even more striking, an “offline PPO” baseline that uses dummy +1/-1 rewards (no learned reward model, no preference structure beyond the binary label) reaches DPO-level performance below 30B parameters.

That last result is what motivates KTO. If a simple HALO with dummy +1/-1 rewards already lands close to DPO, maybe the magic of DPO is the inductive bias of the loss, not the preference structure of the data.

(C) A new HALO: KTO. Taking the prospect-theoretic logic to its conclusion, the authors design a loss that:

takes a binary desirable/undesirable label per (x, y),
uses a logistic-shaped value function with separate slopes for gains (λD) and losses (λU),
uses the KL between πθ and πref as the reference point (estimated via a clever shuffled-microbatch trick that avoids sampling from πθ),
drops the need for paired data entirely.

Figure 3 shows that SFT+KTO is competitive with SFT+DPO across all scales 1B–30B, and KTO alone is significantly better than DPO alone at 7B and 30B. Table 2 shows that on GSM8K, swapping DPO for KTO when aligning Zephyr-β-SFT on UltraFeedback improves accuracy by 13.5 points (40.0 → 53.5). Figure 5 shows that with only 10% of the desirable data (a 1:10 imbalance), KTO still matches DPO — a regime where DPO has no natural way to operate.

The combined message: the right inductive bias matters more than the data format. KTO is what you get when you take that lesson seriously.

3. Method details

3.1 The HALO definition, intuitively

A HALO has the schematic shape

f(πθ, πref) =  E_{x,y∼D}[ sign(x,y) · v( implied_reward(x,y) − reference_point(x) ) ]  +  const

Several pieces are worth unpacking.

implied_reward(x, y) = β · log(πθ(y|x) / πref(y|x)). Sometimes called the “log-likelihood ratio reward.” If we drop β, this is exactly the policy/ref log-ratio that appears in DPO.
reference_point(x) = E_{y'∼Q(·|x)} [implied_reward(x, y')]. The expected reward under some reference distribution Q. The choice of Q is what differentiates DPO (Q = paired dispreferred response), PPO-Clip (Q = a per-token average), and KTO (Q = the KL between πθ and πref).
v is the value function. It must be non-decreasing and concave on the gain side. It does not need to be convex on the loss side — the authors leave room here because not every individual is risk-seeking in losses.
sign(x, y) ∈ {−1, +1} flips for desirable vs. undesirable outcomes.

DPO fits this template with v = log σ, Q being a one-sample-point estimator over the paired dispreferred response, and sign = +1 for yw, −1 for yl. Conditional SFT does not fit, because there is no πref-dependent reference point.

3.2 The KTO loss

KTO replaces the Tversky-Kahneman (z − z0)^α value function with a logistic for numerical stability:

rθ(x, y)  =  log(πθ(y|x) / πref(y|x))

z0(x)     =  KL(πθ(y'|x) ∥ πref(y'|x))      // the reference point

v(x, y)   =  λD · σ( β · (rθ(x, y) − z0) )       if y is desirable
v(x, y)   =  λU · σ( β · (z0 − rθ(x, y)) )       if y is undesirable

LKTO(πθ, πref) = E_{x,y∼D} [ λy − v(x, y) ]

Three things to note about this construction.

The sigmoid σ provides the prospect-theoretic shape. It is concave on its rising side (gains) and convex on its falling side (losses); since σ is bounded in [0, 1], the loss is bounded too, which gives KTO better training stability than methods with unbounded log-likelihood losses on out-of-distribution data.
β controls risk aversion. Smaller β makes the sigmoid steeper around z0, so the model is more sensitive to small reward changes near the reference point — like a more risk-averse human. Larger β flattens it, so the model can tolerate larger reward swings.
λD and λU control loss aversion. They scale the gain and loss branches independently. If your data has a 1:10 desirable:undesirable imbalance, you can compensate by setting λD ≈ 10–15 and λU = 1. The paper recommends keeping λD·nD / (λU·nU) ∈ [1, 3/2], lightly favoring desirable examples on most benchmarks because “producing good outputs is more important than avoiding bad outputs.”

3.3 The KL reference point — a clever microbatch trick

The hardest part of KTO in practice is estimating z0 = KL(πθ(y'|x) ∥ πref(y'|x)). The honest way to do this is to sample y' from πθ, but sampling from a 7B-30B model in a training loop is expensive — that is precisely the cost RLHF is trying to avoid.

The trick: instead of sampling fresh y', shuffle existing batch outputs so that example i is paired with output j ≠ i. For a microbatch of size m, with j = (i + 1) mod m,

ẑ0  =  max( 0,  (1/m) · Σ_{i<m}  log( πθ(yj | xi) / πref(yj | xi) ) )

This is biased — yj is a real model output, just one assigned to a different prompt — but it has very low variance because we never sample new tokens. The bias is mostly upward (which is why the max(0, ·) clamp exists); the authors argue this matches a real human’s “availability heuristic” anyway. Crucially, gradients do not flow through ẑ0: it only modulates the saturation of the sigmoid. This is what makes KTO trainable at scale without an expensive sampling step.

There is one degenerate case: when KTO is preceded by SFT on the same data that defines the desirable set, and πref is the SFT model, then ẑ0 → 0 and the formula simplifies. The authors note that in this regime you can drop the ẑ0 estimation entirely and just set it to zero, saving one forward pass per microbatch. When KTO is not preceded by SFT, or when the SFT data is disjoint from the KTO data, estimating ẑ0 is necessary.

3.4 What data shape does KTO accept?

Three input formats are supported.

Naturally binary feedback. Each (x, y) carries a label desirable or undesirable. This is the cleanest case.
Preference data converted to binary. Given (x, yw, yl), treat (x, yw) as desirable and (x, yl) as undesirable. This doubles the effective example count from n pairs to 2n labeled examples.
One-y-per-x. Strip the pairing structure entirely: keep only one of yw or yl for each x. This simulates a setting where you cannot pair feedback at all.

Table 3 of the paper shows that on OpenAssistant + Mistral-7B, the one-y-per-x configuration cuts data by 72% but the resulting KTO model still beats both the DPO baseline and the official Mistral-7B-Instruct. That is the strongest evidence I have seen that pairing structure is not the load-bearing element in DPO-style alignment.

3.5 Equivalence class of rewards and a worst-case advantage

§4.4 of the paper is a short but important theoretical analysis. Two results stand out.

Proposition 4.1 (vanishing gradients on extreme rewards). As rθ(x, y) → ±∞, the KTO gradient on that example goes to zero. Combined with the bounded σ value function, this means KTO ignores examples whose implied reward is very large in magnitude — either because they are too easy or too noisy. Real-world feedback is noisy (Hoeffler & Ariely 1999), so a built-in noise tolerance is a feature. The downside: KTO can underfit hard-to-learn modes if β is set too high. The authors recommend lowering β and training for more epochs when you suspect underfitting.

Theorem 4.3 (worst-case under intransitive preferences). Suppose a fraction p ∈ (0.5, 1) of annotators prefer ya to yb and the remaining (1 − p) prefer yb to ya. Under a sufficiently small p and a sufficiently unaligned πref, the optimal DPO policy can actually prefer the minority answer yb. The optimal loss-neutral KTO policy (λD = λU) always produces the majority answer ya. This is a surprisingly strong worst-case property and helps explain why KTO outperforms DPO on real datasets with noisy human labels (Anthropic-HH, SHP, OpenAssistant).

These two results together give the clearest theoretical reason to prefer KTO: it is more robust to noisy and intransitive feedback, which is what most production data looks like.

4. Experiment setup

The paper tests KTO across two model families and several datasets. I will summarize the experimental design and call out the choices that matter.

4.1 Models

Pythia family: 1.4B, 2.8B, 6.9B, 12B. Used for the controlled HALO-vs-non-HALO sweep in §3.3.
Llama family: 7B, 13B, 30B. Used for the same sweep, plus the headline KTO-vs-DPO comparison in §4.3.
Mistral-7B and Zephyr-β-SFT: Used for the UltraFeedback alignment experiments and the one-y-per-x ablation.
Llama-3 8B and Qwen2.5 3B Instruct: Used in Table 1 for hyperparameter recommendations on more recent base models.

The choice to span Pythia and Llama, rather than only one family, makes the HALO-vs-non-HALO comparison much more credible: the effect is not an artifact of one architecture.

4.2 Data

Anthropic-HH (Ganguli 2022), OpenAssistant / OASST (Köpf 2023), SHP (Ethayarajh 2022) for the controlled sweep. All three are preference datasets, but KTO treats each yw as desirable and each yl as undesirable, so they double in effective example count.
UltraFeedback (Cui 2023) for the KTO-vs-DPO follow-up. This is a much larger and harder preference dataset and is the standard for 2024-era alignment experiments.
For the SFT phase, {yw} is used as supervised target.

4.3 Baselines

CSFT (Conditional SFT) — control-token prefix; the simplest non-HALO baseline.
SLiC — max-margin loss plus a language-modeling regularizer; non-HALO.
DPO — the strongest HALO baseline; standard reference for 2024.
PPO (offline) — the authors’ simplified offline PPO with dummy +1/-1 rewards; HALO.
ORPO — recent reference-free method; included for the no-πref experiments.

4.4 Evaluation

Two evaluation paradigms are used.

GPT-4-judged winrate against the SFT target. The standard 2024 LLM-as-a-judge metric. The authors validate in Appendix D that GPT-4 judgments agree with human judgments.
Closed-ended benchmarks: MMLU (0-shot), GSM8K (8-shot CoT), HumanEval (0-shot), BBH (3-shot CoT). These are exact-match or pass@1, no judge involved.

The mix is important: open-ended winrate captures the kind of “did this feel like a better answer” judgment that alignment is supposed to optimize for, while closed-ended benchmarks pin down whether the alignment changes also improve genuine task accuracy.

4.5 Hardware and hyperparameters

The headline experiments use AdamW with an effective batch size of 32 and the following ranges (Table 1):

Model	Method	LR	β	λD/λU
Llama-3 8B	SFT+KTO	5e-6	0.05	1/1
Llama-3 8B	KTO (no SFT)	5e-6	0.10	1/1
Qwen2.5 3B Instruct	SFT+KTO	5e-6	0.10	1/1
Qwen2.5 3B Instruct	KTO (no SFT)	5e-6	0.50	1/1

Two observations.

KTO uses a learning rate roughly 2× to 10× larger than DPO (DPO default is 5e-7; KTO default is 5e-6). This makes sense because rθ − z0 is on average smaller in magnitude for KTO than β·log(πθ(yw|x)/πref(yw|x)) − β·log(πθ(yl|x)/πref(yl|x)) for DPO. To make non-trivial progress in the same number of steps, KTO needs a larger learning rate.
β is lower (more risk-averse / steeper) for models that already underwent SFT, and higher for models trained from πref = base. The paper explicitly recommends β ∈ [0.01, 0.10] for post-SFT KTO and β ∈ [0.10, 1.00] for direct KTO without SFT.

4.6 Class-imbalance experiments

A particularly clean experiment varies the desirable:undesirable ratio from 1:1 down to 1:10 by randomly discarding desirable examples. For each ratio, the authors adjust λD/λU to satisfy λD·nD / (λU·nU) ∈ [1, 3/2]. The result (Figure 5): KTO with just 10% of the desirable data still matches DPO with the full data. This is the experiment that matters most for production deployments where positive feedback is rare relative to negative feedback (or vice versa).

5. Results & analysis

5.1 HALOs beat non-HALOs at scale (Figure 2)

The first big result is the controlled HALO-vs-non-HALO sweep. The y-axis plots winrate-above-50% against the SFT target, judged by GPT-4-0613. Two observations:

Below 13B parameters, the gap between HALOs (DPO, offline PPO) and non-HALOs (CSFT, SLiC) is not significant.
From 13B upward, HALOs clearly win; only HALO-aligned Llama-13B and Llama-30B clear the 50% winrate bar.
Offline PPO with dummy +1/-1 rewards matches DPO at every scale below 30B, despite having no learned reward model.

My reading: the “alignment phase transition” the paper describes is real, and it is driven by the loss family, not by the data shape. At low scale, any reasonable loss does roughly equally well. At higher scale, the loss starts to matter because the model has enough capacity to actually exploit the inductive bias.

5.2 KTO matches or exceeds DPO (Figure 3, Table 2)

The headline KTO experiments are even more striking. With matched data and matched compute:

At 1B-30B Llama scale, KTO is competitive with DPO and significantly better at 7B and 30B (p < 0.01).
On Zephyr-β-SFT + UltraFeedback (Table 2 top), KTO improves over DPO on every closed-ended benchmark in their suite, with GSM8K going from 40.0 (DPO) to 53.5 (KTO) — a 13.5-point absolute jump. This is enormous for math reasoning.
KTO with one-y-per-x (Table 3) on Mistral-7B + OpenAssistant cuts training data by 72% but still outperforms DPO and the official Mistral-7B-Instruct model.

The Zephyr GSM8K result deserves more attention than it usually gets. UltraFeedback is a preference dataset; turning it into binary labels for KTO is in some sense “throwing away” the pair structure. The fact that KTO still beats DPO by 13.5 points on GSM8K suggests that the pair structure is at best neutral, and possibly counterproductive on noisy datasets where many pairs disagree.

5.3 KTO without SFT still works (Figure 4)

A surprising finding: on Llama-13B and Llama-30B, KTO without prior SFT is competitive with SFT+KTO. Other methods like DPO without SFT collapse — they tend to ramble, hallucinate entire conversations, and balloon in response length. KTO without SFT keeps average response length stable. This matters because the SFT stage is expensive, and at sufficient scale you can skip it.

5.4 Imbalance robustness (Figure 5)

The desirable:undesirable imbalance experiment is the cleanest demonstration of KTO’s data-efficiency advantage. With the desirable set reduced to 10% of its original size and λD raised to compensate, KTO still matches or exceeds DPO trained on the full dataset. This is the regime that real production data looks like — far more negative interactions than positive ones, or vice versa, depending on your product.

5.5 Ablations confirm the design (Table 2 middle/bottom)

The ablation block confirms each KTO design choice carries weight:

Remove the reference point z0 → −3.6 BBH, −4.0 GSM8K. This is necessary for HALO-hood.
Replace the value function with a concave-everywhere −log σ → −9.4 BBH, −11.0 GSM8K. The S-shape matters.
Replace with a risk-neutral identity v(·) = · → BBH collapses to 6.1. Prospect-theoretic curvature really matters.
No πref → the memory-efficient variant works but trails standard KTO.
β sweeps: lower β (0.01) gives higher GSM8K but slightly lower BBH; higher β (0.5) does the reverse.
λD sweeps: under-weighting desirable examples (λD = 0.5) collapses BBH; over-weighting (λD = 2.0) is also worse than the default 1.0 on UltraFeedback.

The collection of ablations is consistent with the prospect-theoretic story: removing reference-dependence, replacing the curvature, or making the loss risk-neutral all break the inductive bias, and each break hurts.

5.6 What I find most interesting

Two things.

First, the fact that offline PPO with dummy +1/-1 rewards matches DPO below 30B is the single most surprising line in the paper. It is a clean reductio: if reward learning were the secret to RLHF, this baseline could not work. The fact that it does work tells you the secret is somewhere else — in the shape of the loss, the KL constraint, and the implicit reference structure.

Second, the noise robustness theorem (4.3) is the strongest theoretical result in the paper, even though it gets less attention than the headline numbers. It quantifies why KTO should beat DPO on real data: real datasets contain intransitive preferences, and KTO has provably better worst-case behavior under intransitivity. The empirical 13.5-point jump on GSM8K via UltraFeedback is consistent with this — UltraFeedback is known to contain significant inter-annotator disagreement.

6. Limitations & boundary conditions

6.1 What the authors explicitly acknowledge

KTO can underfit. Because Proposition 4.1 says the gradient vanishes on extreme rewards, KTO can ignore data that is hard-to-learn but necessary to recover the true reward. This may matter on tasks where complex, low-probability behaviors are precisely what you want to learn. Mitigation: lower β, train more epochs.
The reference point estimator is biased. The shuffled-microbatch trick produces an upward-biased ẑ0. The authors argue this is acceptable because humans also use heuristic reference points, but it is still a known source of inaccuracy.
The K-T value function for monetary gambles is almost certainly not the right value function for text quality. The authors are explicit that they expect more research on text-domain value functions, especially for multi-modal generation.
HALOs are not universally optimal. The paper does not claim KTO is the best HALO; it claims KTO is a competitive HALO derived from a clean theoretical prior. The right HALO for your task may differ.

6.2 What I noticed that the paper does not foreground

No native multi-turn modeling. KTO assigns a single binary label to a full (x, y) pair. Long dialogs with mixed-quality turns get a single thumbs-up or thumbs-down. This collapses fine-grained credit assignment into a sequence-level signal, which is fine for short responses but lossy for long agent rollouts. Subsequent work (process reward models, step-level KTO variants) addresses this directly.
Imbalance correction is somewhat ad-hoc. The recommendation λD·nD / (λU·nU) ∈ [1, 3/2] works empirically but rests on a single sentence of justification. For different downstream tasks (toxicity prevention, refusal robustness) the authors say λU·nU > λD·nD might be preferable, but they do not derive guidance from first principles.
The KL-as-reference-point choice is one of many. PPO-Clip uses a per-token average; DPO uses a paired-response sample; KTO uses an estimate of KL(πθ ∥ πref). None of these are necessarily the right reference point, and the authors leave this as an open question.
Sensitivity to πref quality. Like DPO, KTO compares πθ against πref. If πref is bad (heavily biased base, weak SFT, distributional shift), the alignment signal becomes noisy. The paper doesn’t characterize this dependence cleanly.
Limited evaluation diversity. Strong on math (GSM8K), BBH, MMLU; weaker coverage on code beyond HumanEval, on safety-specific benchmarks, on long-context tasks. The numbers we have are convincing for general instruction-tuned chat but less so for the broader space of post-2024 alignment workloads.

6.3 When NOT to use KTO

When your preference data is clean and low-noise (e.g., expert-annotated, paired ranking tasks where annotators agree). The theoretical noise robustness advantage of KTO disappears, and DPO’s tighter use of pair structure may win. Use a small validation sweep to confirm.
When you need fine-grained credit assignment within a long response. KTO’s sequence-level label is too coarse; consider process reward modeling or step-level objectives.
When you have no πref and no SFT, especially at small scale. The memory-efficient KTO variant exists but trails the standard version.

6.4 When KTO is the natural choice

When feedback is naturally binary — thumbs-up/down telemetry, helpfulness flags, abandonment events. This is the canonical KTO setting.
When your dataset is heavily imbalanced (e.g., 1:10 desirable:undesirable). KTO handles this via λD/λU with no architectural change.
When you suspect annotator disagreement and noise. Theorem 4.3 directly addresses this regime.

7. Reproducibility & practical notes

7.1 Code availability

The authors release code under github.com/ContextualAI/HALOs and post checkpoints on HuggingFace. The code is in PyTorch, integrates with the standard HuggingFace trl library, and is reasonably well documented. KTO is now also implemented as a first-class trainer in HuggingFace TRL (KTOTrainer), so most production users will pull from there rather than the original repository.

7.2 Compute requirements

The experiments span up to Llama-30B alignment with 32-effective-batch and AdamW. As a rough rule of thumb (matching DPO):

1B-7B: 1-4 H100 / A100 80GB GPUs, hours to a day per epoch on UltraFeedback-scale data.
13B: 4-8 H100/A100, half a day to a day per epoch.
30B: 8+ A100/H100 with FSDP, multiple days per full alignment run.

KTO does not change these numbers materially — it has the same memory footprint as DPO (both store πθ and πref). The memory-efficient KTO variant without πref saves roughly half the activation memory but loses a few benchmark points.

7.3 Practitioner tips

Start with the recommended hyperparameters from Table 1. Learning rate 5e-6 (AdamW), β = 0.05–0.10 for post-SFT KTO, λD = λU = 1 for balanced data.
Always estimate ẑ0 when KTO is not preceded by SFT on the same data. Setting ẑ0 = 0 only works in the SFT-then-KTO regime.
Check class balance before training. If your desirable:undesirable ratio is far from 1:1, set λD = nU / nD (and λU = 1) as a first attempt, then sweep within the [1, 3/2] ratio band recommended in the paper.
Watch for response-length drift. DPO without SFT explodes; KTO without SFT does not (Figure 4). If you see length blowing up in a KTO run, your β is probably too high.
Validate on closed-ended benchmarks, not just judge winrate. GSM8K and BBH catch hidden regressions that GPT-4-as-judge will miss.

7.4 Production hooks

Two of KTO’s design choices are particularly friendly for production deployment.

Binary labels integrate naturally with telemetry. A thumbs-up button, a “helpful” rating, a “report” click — all are binary signals that KTO can ingest directly. No labeler studio required.
Imbalance robustness matches real distributions. Real telemetry is heavily biased toward “no feedback at all,” with positive feedback typically 5-10x more common than explicit negative feedback (or vice versa depending on the product). KTO handles this natively via λD/λU.

That said, KTO is still an offline alignment method. Online deployment requires periodic snapshots, πref management, and standard MLOps hygiene around drift, regression testing, and A/B rollout.

References

Ethayarajh, K. et al. KTO: Model Alignment as Prospect Theoretic Optimization. ICML 2024. arXiv:2402.01306.
Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290.
Christiano, P. et al. Deep Reinforcement Learning from Human Preferences. NeurIPS 2017. arXiv:1706.03741.
Ouyang, L. et al. Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. arXiv:2203.02155.
Schulman, J. et al. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347.
Kahneman, D. & Tversky, A. Prospect Theory: An Analysis of Decision under Risk. Econometrica 1979.
Tversky, A. & Kahneman, D. Advances in Prospect Theory: Cumulative Representation of Uncertainty. Journal of Risk and Uncertainty 1992.
Hong, J. et al. ORPO: Monolithic Preference Optimization without Reference Model. 2024. arXiv:2403.07691.
Zhao, Y. et al. SLiC: Sequence Likelihood Calibration. 2023.
Cui, G. et al. UltraFeedback: Boosting Language Models with High-Quality Feedback. 2023.

Review written on 2026-05-19 by Zhongzhu Zhou.