CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Cheng, Zihui; Chen, Qiguang; Zhang, Jin; Fei, Hao; Feng, Xiaocheng; Che, Wanxiang; Li, Min; Qin, Libo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.12932 (cs)

[Submitted on 17 Dec 2024]

Title:CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Authors:Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, Libo Qin

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

Comments:	Accepted at AAAI 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.12932 [cs.CV]
	(or arXiv:2412.12932v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.12932

Submission history

From: Zihui Cheng [view email]
[v1] Tue, 17 Dec 2024 14:10:16 UTC (2,894 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators