Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Li, You; Huang, Heyu; Chen, Chi; Huang, Kaiyu; Huang, Chao; Guo, Zonghao; Liu, Zhiyuan; Xu, Jinan; Li, Yuhua; Li, Ruixuan; Sun, Maosong

Computer Science > Computation and Language

arXiv:2501.05767 (cs)

[Submitted on 10 Jan 2025 (v1), last revised 13 Jan 2025 (this version, v2)]

Title:Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Authors:You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun

View PDF HTML (experimental)

Abstract:The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at this https URL.

Comments:	20 pages, 8 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.05767 [cs.CL]
	(or arXiv:2501.05767v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.05767

Submission history

From: You Li [view email]
[v1] Fri, 10 Jan 2025 07:56:23 UTC (17,020 KB)
[v2] Mon, 13 Jan 2025 10:38:32 UTC (17,020 KB)

Computer Science > Computation and Language

Title:Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators