Establishing Reliability Metrics for Reward Models in Large Language Models

Chen, Yizhou; Liu, Yawen; Wang, Xuesi; Yu, Qingtao; Huzhang, Guangda; Zeng, Anxiang; Yu, Han; Zhou, Zhiming

Computer Science > Artificial Intelligence

arXiv:2504.14838 (cs)

[Submitted on 21 Apr 2025]

Title:Establishing Reliability Metrics for Reward Models in Large Language Models

Authors:Yizhou Chen, Yawen Liu, Xuesi Wang, Qingtao Yu, Guangda Huzhang, Anxiang Zeng, Han Yu, Zhiming Zhou

View PDF HTML (experimental)

Abstract:The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$\eta$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $\eta$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.14838 [cs.AI]
	(or arXiv:2504.14838v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2504.14838

Submission history

From: Guangda Huzhang [view email]
[v1] Mon, 21 Apr 2025 03:39:33 UTC (1,123 KB)

Computer Science > Artificial Intelligence

Title:Establishing Reliability Metrics for Reward Models in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Establishing Reliability Metrics for Reward Models in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators