Aligning large language models with pointwise absolute rewards has so far required online, on-policy
algorithms such as PPO and GRPO.
In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL,
are limited to learning from preference pairs or relative signals.
To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns
from pointwise absolute rewards while preserving the simplicity and offline applicability of
DPO-like methods.
QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL
objective.
This reward yields an analytically tractable partition function, removing the need for relative
signals to cancel this term.
Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension
for pre-computation scaling.
Empirically, QRPO consistently achieves top performance on chat and coding evaluations—reward model
scores, AlpacaEval 2, and LeetCode—compared to DPO, REBEL, and SimPO across diverse datasets and
8B-scale models.
Finally, we find that training with robust rewards instead of converting them to preferences induces
less length bias.