Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Hu, Zijing; Zhang, Fengda; Chen, Long; Kuang, Kun; Li, Jiahui; Gao, Kaifeng; Xiao, Jun; Wang, Xin; Zhu, Wenwu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.11240 (cs)

[Submitted on 14 Mar 2025 (v1), last revised 27 Mar 2025 (this version, v2)]

Title:Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Authors:Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu

View PDF HTML (experimental)

Abstract:Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.

Comments:	Accepted to CVPR 2025, add references
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2503.11240 [cs.CV]
	(or arXiv:2503.11240v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.11240

Submission history

From: Zijing Hu [view email]
[v1] Fri, 14 Mar 2025 09:45:19 UTC (25,797 KB)
[v2] Thu, 27 Mar 2025 02:34:59 UTC (25,797 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators