Streaming Looking Ahead with Token-level Self-reward

Zhang, Hongming; Hong, Ruixin; Yu, Dong

Abstract:Autoregressive decoding algorithms that use only past information often cannot guarantee the best performance. Recently, people discovered that looking-ahead algorithms such as Monte Carlo Tree Search (MCTS) with external reward models (RMs) can significantly improve models' output by allowing them to think ahead and leverage future outputs and associated rewards to guide the current generation. Such techniques can help the reinforcement fine-tuning phase by sampling better trajectories and the inference phase by selecting the better output. However, their high computational cost limits their applications, especially in streaming scenarios. To address this issue, we propose equipping the policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication. We name the new architecture as Reward Transformer. In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization. Experiments show that SLA achieves an overall win rate of 79.7\% against the baseline greedy decoding algorithm on three general-domain datasets with a frozen policy model while maintaining streaming efficiency. If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4\%.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.00029 [cs.LG]
	(or arXiv:2503.00029v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2503.00029

Computer Science > Machine Learning

Title:Streaming Looking Ahead with Token-level Self-reward

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators