$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Wu, Junkang; Xie, Yuexiang; Yang, Zhengyi; Wu, Jiancan; Gao, Jinyang; Ding, Bolin; Wang, Xiang; He, Xiangnan

Computer Science > Artificial Intelligence

arXiv:2407.08639 (cs)

[Submitted on 11 Jul 2024]

Title:$β$-DPO: Direct Preference Optimization with Dynamic $β$

Authors:Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

View PDF

Abstract:Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $\beta$, as well as to the quality of the preference data. We analyze the impact of $\beta$ and data quality on DPO, uncovering that optimal $\beta$ values vary with the informativeness of pairwise data. Addressing the limitations of static $\beta$ values, we introduce a novel framework that dynamically calibrates $\beta$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $\beta$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $\beta$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{this https URL}.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2407.08639 [cs.AI]
	(or arXiv:2407.08639v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2407.08639

Submission history

From: Junkang Wu [view email]
[v1] Thu, 11 Jul 2024 16:21:18 UTC (356 KB)

Computer Science > Artificial Intelligence

Title:$β$-DPO: Direct Preference Optimization with Dynamic $β$

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:$β$-DPO: Direct Preference Optimization with Dynamic $β$

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators