Local Slot Attention for Vision-and-Language Navigation

Zhuang, Yifeng; Sun, Qiang; Fu, Yanwei; Chen, Lifeng; Xue, Xiangyang

doi:10.1145/3512527.3531366

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.08645 (cs)

[Submitted on 17 Jun 2022 (v1), last revised 22 Jun 2022 (this version, v2)]

Title:Local Slot Attention for Vision-and-Language Navigation

Authors:Yifeng Zhuang, Qiang Sun, Yanwei Fu, Lifeng Chen, Xiangyang Xue

View PDF

Abstract:Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments.
Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language.
However, there exist two problems in current transformer-based models.
1) The models process each view independently without taking the integrity of the objects into account.
2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information.
To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.

Comments:	ICMR 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.08645 [cs.CV]
	(or arXiv:2206.08645v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.08645
Related DOI:	https://doi.org/10.1145/3512527.3531366

Submission history

From: Yifeng Zhuang [view email]
[v1] Fri, 17 Jun 2022 09:21:26 UTC (3,992 KB)
[v2] Wed, 22 Jun 2022 02:32:32 UTC (3,992 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Local Slot Attention for Vision-and-Language Navigation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Local Slot Attention for Vision-and-Language Navigation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators