Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Perincherry, Akhil; Krantz, Jacob; Lee, Stefan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.16394 (cs)

[Submitted on 20 Mar 2025]

Title:Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Authors:Akhil Perincherry, Jacob Krantz, Stefan Lee

View PDF HTML (experimental)

Abstract:Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Cite as:	arXiv:2503.16394 [cs.CV]
	(or arXiv:2503.16394v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.16394

Submission history

From: Akhil Perincherry [view email]
[v1] Thu, 20 Mar 2025 17:53:12 UTC (11,436 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators