Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

Sun, Wangtao; Cheng, Xiang; Yu, Xing; Xu, Haotian; Yang, Zhao; He, Shizhu; Zhao, Jun; Liu, Kang

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models. However, reward hacking-a phenomenon where models exploit flaws in the reward model-remains a significant barrier to achieving robust and scalable intelligence through long-term training. Existing studies have proposed uncertain reward model to address reward hacking, however, they often lack systematic or theoretical foundations, failing to model the uncertainty intrinsically emerging from preference data. In this paper, we propose the Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model. PURM learns reward distributions directly from preference data and quantifies per-sample uncertainty via the average overlap area between reward distributions. To mitigate reward hacking, we further introduce an uncertainty-aware penalty into Proximal Policy Optimization (PPO), which leverages the learned uncertainty to dynamically balance reward optimization and exploration. We propose a lightweight and easy-to-use implementation of PURM. Experiments demonstrate that PURM significantly delays the onset of reward hacking while improving final reward performance, outperforming baseline methods in both stability and effectiveness.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2503.22480 [cs.LG]
	(or arXiv:2503.22480v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2503.22480

Computer Science > Machine Learning

Title:Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators