ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Zhou, Jingqi; Wang, Sheng; Dong, Jingwei; Li, Lei; Gao, Jiahui; Jiang, Jiyue; Kong, Lingpeng; Wu, Chuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.14138 (cs)

[Submitted on 18 Oct 2024 (v1), last revised 27 Mar 2025 (this version, v2)]

Title:ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Authors:Jingqi Zhou, Sheng Wang, Jingwei Dong, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, Chuan Wu

View PDF HTML (experimental)

Abstract:Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2410.14138 [cs.CV]
	(or arXiv:2410.14138v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.14138

Submission history

From: Jingqi Zhou [view email]
[v1] Fri, 18 Oct 2024 03:22:06 UTC (1,196 KB)
[v2] Thu, 27 Mar 2025 08:07:19 UTC (972 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators