Interpretable by Design Visual Question Answering

Fu, Xingyu; Zhou, Ben; Chen, Sihao; Yatskar, Mark; Roth, Dan

Computer Science > Computation and Language

arXiv:2305.14882v1 (cs)

[Submitted on 24 May 2023 (this version), latest version 13 Apr 2024 (v2)]

Title:Interpretable by Design Visual Question Answering

Authors:Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth

View PDF

Abstract:Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle -- to treat interpretability as an explicit additional goal.
Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.

Comments:	Multimodal, Vision and Language
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.14882 [cs.CL]
	(or arXiv:2305.14882v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.14882

Submission history

From: Xingyu Fu [view email]
[v1] Wed, 24 May 2023 08:33:15 UTC (1,166 KB)
[v2] Sat, 13 Apr 2024 17:13:55 UTC (11,089 KB)

Computer Science > Computation and Language

Title:Interpretable by Design Visual Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Interpretable by Design Visual Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators