RL-finetuning LLMs from on- and off-policy data with a single algorithm

Tang, Yunhao; Cohen, Taco; Zhang, David W.; Valko, Michal; Munos, Rémi

Computer Science > Machine Learning

arXiv:2503.19612 (cs)

[Submitted on 25 Mar 2025 (v1), last revised 28 Mar 2025 (this version, v2)]

Title:RL-finetuning LLMs from on- and off-policy data with a single algorithm

Authors:Yunhao Tang, Taco Cohen, David W. Zhang, Michal Valko, Rémi Munos

View PDF HTML (experimental)

Abstract:We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the mathematical reasoning dataset over baseline algorithms.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2503.19612 [cs.LG]
	(or arXiv:2503.19612v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2503.19612

Submission history

From: Yunhao Tang [view email]
[v1] Tue, 25 Mar 2025 12:52:38 UTC (674 KB)
[v2] Fri, 28 Mar 2025 18:02:54 UTC (674 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2025-03

Change to browse by:

References & Citations

export BibTeX citation

Computer Science > Machine Learning

Title:RL-finetuning LLMs from on- and off-policy data with a single algorithm

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:RL-finetuning LLMs from on- and off-policy data with a single algorithm

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators