Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Zhou, Yikang; Zhang, Tao; Xu, Shilin; Chen, Shihao; Zhou, Qianyu; Tong, Yunhai; Ji, Shunping; Zhang, Jiangning; Li, Xiangtai; Qi, Lu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.04670 (cs)

[Submitted on 8 Jan 2025]

Title:Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Authors:Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, Lu Qi

View PDF HTML (experimental)

Abstract:Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41\% and 23.58\% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at this https URL.

Comments:	project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.04670 [cs.CV]
	(or arXiv:2501.04670v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.04670

Submission history

From: Yikang Zhou [view email]
[v1] Wed, 8 Jan 2025 18:30:53 UTC (21,302 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators