Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

Chu, Xu; Zhang, Zhixin; Jia, Tianyu; Jin, Yujie

Computer Science > Machine Learning

arXiv:2502.18099 (cs)

[Submitted on 25 Feb 2025 (v1), last revised 27 Feb 2025 (this version, v2)]

Title:Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

Authors:Xu Chu, Zhixin Zhang, Tianyu Jia, Yujie Jin

View PDF HTML (experimental)

Abstract:Aligning language models with human preferences is critical for real-world deployment, but existing methods often require large amounts of high-quality human annotations. Aiming at a data-efficient alignment method, we propose Stackelberg Game Preference Optimization (SGPO), a framework that models alignment as a two-player Stackelberg game, where a policy (leader) optimizes against a worst-case preference distribution (follower) within an $\epsilon$-Wasserstein ball, ensuring robustness to (self-)annotation noise and distribution shifts. SGPO guarantees $O(\epsilon)$-bounded regret, unlike Direct Preference Optimization (DPO), which suffers from linear regret growth in the distribution mismatch. We instantiate SGPO with the Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm, which iteratively self-annotates preferences and adversarially reweights synthetic annotated preferences. Using only 2K seed preferences, from the UltraFeedback dataset, i.e., 1/30 of human labels in the dataset, our method achieves 35.82% GPT-4 win-rate with Mistral-7B and 40.12% with Llama3-8B-Instruct within three rounds of SSAPO.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2502.18099 [cs.LG]
	(or arXiv:2502.18099v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.18099

Submission history

From: Zhixin Zhang [view email]
[v1] Tue, 25 Feb 2025 11:08:12 UTC (142 KB)
[v2] Thu, 27 Feb 2025 06:17:28 UTC (142 KB)

Computer Science > Machine Learning

Title:Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators