Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Huang, Qidong; Dong, Xiaoyi; Zhang, Pan; Zang, Yuhang; Cao, Yuhang; Wang, Jiaqi; Lin, Dahua; Zhang, Weiming; Yu, Nenghai

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.07167 (cs)

[Submitted on 9 Oct 2024 (v1), last revised 16 Oct 2024 (this version, v2)]

Title:Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Authors:Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu

View PDF

Abstract:We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbf{Effective} to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbf{Robust} toward different training/evaluation data. 3) \textbf{Generalize} across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: this https URL.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2410.07167 [cs.CV]
	(or arXiv:2410.07167v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.07167

Submission history

From: Qidong Huang [view email]
[v1] Wed, 9 Oct 2024 17:59:04 UTC (1,173 KB)
[v2] Wed, 16 Oct 2024 07:23:03 UTC (1,174 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators