VD-BERT: A Unified Vision and Dialog Transformer with BERT

Wang, Yue; Joty, Shafiq; Lyu, Michael R.; King, Irwin; Xiong, Caiming; Hoi, Steven C. H.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2004.13278 (cs)

[Submitted on 28 Apr 2020 (v1), last revised 2 Nov 2020 (this version, v3)]

Title:VD-BERT: A Unified Vision and Dialog Transformer with BERT

Authors:Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi

View PDF

Abstract:Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at this https URL.

Comments:	EMNLP 2020 (14 pages)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2004.13278 [cs.CV]
	(or arXiv:2004.13278v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2004.13278

Submission history

From: Yue Wang [view email]
[v1] Tue, 28 Apr 2020 04:08:46 UTC (3,623 KB)
[v2] Wed, 29 Apr 2020 08:41:22 UTC (3,623 KB)
[v3] Mon, 2 Nov 2020 09:07:41 UTC (5,334 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2020-04

Change to browse by:

cs
cs.CL

References & Citations

DBLP - CS Bibliography

listing | bibtex

Yue Wang
Shafiq R. Joty
Michael R. Lyu
Irwin King
Caiming Xiong

…

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:VD-BERT: A Unified Vision and Dialog Transformer with BERT

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VD-BERT: A Unified Vision and Dialog Transformer with BERT

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators