ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Majumdar, Arjun; Aggarwal, Gunjan; Devnani, Bhavika; Hoffman, Judy; Batra, Dhruv

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.12403 (cs)

[Submitted on 24 Jun 2022 (v1), last revised 13 Oct 2023 (this version, v2)]

Title:ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Authors:Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, Dhruv Batra

View PDF

Abstract:We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

Comments:	code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:	arXiv:2206.12403 [cs.CV]
	(or arXiv:2206.12403v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.12403

Submission history

From: Arjun Majumdar [view email]
[v1] Fri, 24 Jun 2022 17:59:02 UTC (5,552 KB)
[v2] Fri, 13 Oct 2023 03:48:11 UTC (5,553 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators