A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Zhou, Shijie; Zhang, Ruiyi; Zhou, Yufan; Chen, Changyou

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.16364 (cs)

[Submitted on 20 Dec 2024]

Title:A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Authors:Shijie Zhou, Ruiyi Zhang, Yufan Zhou, Changyou Chen

View PDF

Abstract:Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.

Comments:	COLING 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2412.16364 [cs.CV]
	(or arXiv:2412.16364v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.16364

Submission history

From: Shijie Zhou [view email]
[v1] Fri, 20 Dec 2024 21:55:15 UTC (13,500 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators