MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Yang, Shuo; Luo, Siwen; Han, Soyeon Caren; Hovy, Eduard

Computer Science > Computation and Language

arXiv:2503.18491 (cs)

[Submitted on 24 Mar 2025]

Title:MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Authors:Shuo Yang, Siwen Luo, Soyeon Caren Han, Eduard Hovy

View PDF HTML (experimental)

Abstract:Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

Comments:	8 Pages, 5 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2503.18491 [cs.CL]
	(or arXiv:2503.18491v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.18491

Submission history

From: Shuo Yang [view email]
[v1] Mon, 24 Mar 2025 09:45:26 UTC (6,934 KB)

Computer Science > Computation and Language

Title:MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators