Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Zhang, Yundong; Niebles, Juan Carlos; Soto, Alvaro

Computer Science > Computer Vision and Pattern Recognition

arXiv:1808.00265 (cs)

[Submitted on 1 Aug 2018]

Title:Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Authors:Yundong Zhang, Juan Carlos Niebles, Alvaro Soto

View PDF

Abstract:A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

Comments:	8 pages, 4 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:1808.00265 [cs.CV]
	(or arXiv:1808.00265v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1808.00265

Submission history

From: Yundong Zhang [view email]
[v1] Wed, 1 Aug 2018 11:06:08 UTC (6,809 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2018-08

Change to browse by:

cs
cs.AI
cs.CL
cs.LG

References & Citations

DBLP - CS Bibliography

listing | bibtex

Yundong Zhang
Juan Carlos Niebles
Alvaro Soto

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators