Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

Hu, Wanpeng; Liu, Haodi; Chen, Lin; Zhou, Feng; Xiao, Changming; Yang, Qi; Zhang, Changshui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.02964 (cs)

[Submitted on 6 Jan 2025 (v1), last revised 7 Jan 2025 (this version, v2)]

Title:Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

Authors:Wanpeng Hu, Haodi Liu, Lin Chen, Feng Zhou, Changming Xiao, Qi Yang, Changshui Zhang

View PDF HTML (experimental)

Abstract:Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model's ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ's remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.02964 [cs.CV]
	(or arXiv:2501.02964v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.02964

Submission history

From: Wanpeng Hu [view email]
[v1] Mon, 6 Jan 2025 12:16:56 UTC (1,964 KB)
[v2] Tue, 7 Jan 2025 02:55:15 UTC (1,964 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators