Quantile Reward Policy Optimization (QRPO)

Abstract

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals.

To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term.

Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations—reward model scores, AlpacaEval 2, and LeetCode—compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.

What does it bring?

We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data!

❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
✅ QRPO can do both!

How do we do that?

We tackle the infamous "...partition function is known to be intractable..." problem 🧐

This is the problem that limits DPO-like methods to pairwise data.

We solve it thanks to 3 insights! 💡

1️⃣ The "infinite sum over all possible LLM generations" argument is a myth. We rewrite the partition function Z in terms of rewards, revealing that Z is, in fact, given by the moment generating function (MGF) of the reward distribution!
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
3️⃣ We can transform the reward distribution to make it known. Reward quantiles have a uniform distribution! 🔑

The result: Quantile Reward Policy Optimization 🚀

QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉

No preference pairs. Any data distribution.

Obviously, nothing comes for free, but we give you a great deal! 🤝

QRPO does not need many reference rewards to have effective estimated quantile rewards. For high-quality offline datasets 1-3 are enough.
And you can scale this number to get more signal from off-policy data that you generate from your reference model! 📈

How does it perform empirically?

🥇 QRPO achieves top performance in chat and coding

QRPO achieves top performance in chat and coding compared to DPO, REBEL, and SimPO, each capturing a different way to learn from the reward signal (preference, reward difference, length normalization).

Training with robust rewards is better than converting to preferences

💬 The reward model we use has been trained to be robust to length bias, and we see that this is preserved in QRPO and REBEL which use rewards. But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in overall length.

More to discover!

Is QRPO still subject to the "chosen probabilities decreasing" problem?

Our understanding of the KL-regularized closed-form solution also gives insights into the "DPO chosen probabilities decreasing" problem! 🤔

For QRPO, this is not a mystery anymore; we know exactly where the probabilities should move, and we explain how it's normal for them to decrease when the regularization (beta) is very low. This is simply because the target policy is much further away from the training support 🎯

And we show that for relatively high beta, with good data, the probabilities increase as predicted 💯

QRPO is a framework. You can shape the optimal policy! 🎛️

We derive a framework around QRPO for using transformations on top of the quantile reward. Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.

We derive the partition functions for many of them.

What do these optimal policies look like? 👀

Theoretically, we show the equivalence of a family of transformations in the framework including identity and log, allowing us to qualitatively interpret the quantile reward optimal policy as a Best-of-N policy 🎯

Empirically, each transformation brings different dynamics and it's an exciting open question to compare all of them! 🕵️

BibTeX

@article{matrenok2025qrpo,
  title={Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions},
  author={Simon Matrenok and Skander Moalla and Caglar Gulcehre},
  year={2025},
  eprint={2507.08068},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2507.08068},
}

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

QRPO learns from pointwise absolute rewards like GRPO/PPO but preserves the simplicity and offline applicability of DPO-like methods.