Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Dong, Junnan; Zhang, Qinggang; Zhou, Huachi; Zha, Daochen; Zheng, Pai; Huang, Xiao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.12728 (cs)

[Submitted on 20 Feb 2024 (v1), last revised 3 Mar 2024 (this version, v2)]

Title:Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Authors:Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, Xiao Huang

View PDF HTML (experimental)

Abstract:Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.

Comments:	8 pages,3 figures and 1 page appendix; The processed graphs and codes will be avalibale
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2402.12728 [cs.CV]
	(or arXiv:2402.12728v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.12728

Submission history

From: Junnan Dong [view email]
[v1] Tue, 20 Feb 2024 05:32:24 UTC (408 KB)
[v2] Sun, 3 Mar 2024 04:51:28 UTC (402 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators