Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Tan, Wentao; Cao, Qiong; Zhan, Yibing; Xue, Chao; Ding, Changxing

Abstract:Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images.
First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality. We also use corrupted images to generate rejected answers, forming distinct preference pairs for optimization. Finally, we incorporate an image content alignment loss function alongside Direct Preference Optimization (DPO) loss to reduce hallucinations, ensuring the model focuses on image content.
Experiments show that our framework performs competitively with methods using external information, offering a more efficient and scalable approach to MLLMs.

Comments:	AAAI 2025. The code is available at this https URL
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2412.15650 [cs.LG]
	(or arXiv:2412.15650v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2412.15650

Computer Science > Machine Learning

Title:Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators