Learning Explainable Dense Reward Shapes via Bayesian Optimization

Koo, Ryan; Yang, Ian; Raheja, Vipul; Hong, Mingyi; Jun, Kwang-Sung; Kang, Dongyeop

Computer Science > Machine Learning

arXiv:2504.16272 (cs)

[Submitted on 22 Apr 2025]

Title:Learning Explainable Dense Reward Shapes via Bayesian Optimization

Authors:Ryan Koo, Ian Yang, Vipul Raheja, Mingyi Hong, Kwang-Sung Jun, Dongyeop Kang

View PDF HTML (experimental)

Abstract:Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2504.16272 [cs.LG]
	(or arXiv:2504.16272v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.16272

Submission history

From: Ryan Koo [view email]
[v1] Tue, 22 Apr 2025 21:09:33 UTC (1,922 KB)

Computer Science > Machine Learning

Title:Learning Explainable Dense Reward Shapes via Bayesian Optimization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Learning Explainable Dense Reward Shapes via Bayesian Optimization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators