Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Zhang, Mingyu; Cai, Jiting; Liu, Mingyu; Xu, Yue; Lu, Cewu; Li, Yong-Lu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.19666 (cs)

[Submitted on 29 Jul 2024]

Title:Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Authors:Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li

View PDF HTML (experimental)

Abstract:Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.

Comments:	ECCV 2024, Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.19666 [cs.CV]
	(or arXiv:2407.19666v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.19666

Submission history

From: Mingyu Zhang [view email]
[v1] Mon, 29 Jul 2024 02:56:19 UTC (2,304 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators