Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Zhang, Shun; Chen, Zhenfang; Chen, Sunli; Shen, Yikang; Sun, Zhiqing; Gan, Chuang

Computer Science > Machine Learning

arXiv:2401.16635 (cs)

[Submitted on 30 Jan 2024 (v1), last revised 22 Oct 2024 (this version, v3)]

Title:Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Authors:Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan

View PDF HTML (experimental)

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of-$n$ and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2401.16635 [cs.LG]
	(or arXiv:2401.16635v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2401.16635

Submission history

From: Shun Zhang [view email]
[v1] Tue, 30 Jan 2024 00:17:37 UTC (100 KB)
[v2] Tue, 21 May 2024 22:21:16 UTC (100 KB)
[v3] Tue, 22 Oct 2024 06:19:20 UTC (100 KB)

Computer Science > Machine Learning

Title:Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators