Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Li, Chengzu; Wu, Wenshan; Zhang, Huanyu; Xia, Yan; Mao, Shaoguang; Dong, Li; Vulić, Ivan; Wei, Furu

Computer Science > Computation and Language

arXiv:2501.07542 (cs)

[Submitted on 13 Jan 2025]

Title:Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Authors:Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, Furu Wei

View PDF HTML (experimental)

Abstract:Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

Comments:	11 pages, 6 figures, 4 tables (27 pages, 10 figures, 16 tables including references and appendices)
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2501.07542 [cs.CL]
	(or arXiv:2501.07542v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.07542

Submission history

From: Chengzu Li [view email]
[v1] Mon, 13 Jan 2025 18:23:57 UTC (1,501 KB)

Computer Science > Computation and Language

Title:Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators