Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

Yuan, Fan; Fang, Xiaoyuan; Quan, Rong; Li, Jing; Bi, Wei; Xu, Xiaogang; Li, Piji

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.09041 (cs)

[Submitted on 15 Jan 2025]

Title:Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

Authors:Fan Yuan, Xiaoyuan Fang, Rong Quan, Jing Li, Wei Bi, Xiaogang Xu, Piji Li

View PDF HTML (experimental)

Abstract:Visual Commonsense Reasoning, which is regarded as one challenging task to pursue advanced visual scene comprehension, has been used to diagnose the reasoning ability of AI systems. However, reliable reasoning requires a good grasp of the scene's details. Existing work fails to effectively exploit the real-world object relationship information present within the scene, and instead overly relies on knowledge from training memory. Based on these observations, we propose a novel scene-graph-enhanced visual commonsense reasoning generation method named \textit{\textbf{G2}}, which first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information. We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training. Extensive experiments are conducted on the tasks and datasets of scene graph constructing and visual commonsense answering and explaining, respectively. Experimental results and ablation analysis demonstrate the effectiveness of our proposed framework.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2501.09041 [cs.CV]
	(or arXiv:2501.09041v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.09041

Submission history

From: Fan Yuan [view email]
[v1] Wed, 15 Jan 2025 04:00:36 UTC (6,022 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators