Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Chen, William; Hu, Siyi; Talak, Rajat; Carlone, Luca

Computer Science > Robotics

arXiv:2209.05629 (cs)

[Submitted on 12 Sep 2022 (v1), last revised 8 Nov 2023 (this version, v2)]

Title:Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Authors:William Chen, Siyi Hu, Rajat Talak, Luca Carlone

View PDF

Abstract:Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.

Comments:	arXiv admin note: text overlap with arXiv:2206.04585
Subjects:	Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2209.05629 [cs.RO]
	(or arXiv:2209.05629v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2209.05629

Submission history

From: William Chen [view email]
[v1] Mon, 12 Sep 2022 21:36:58 UTC (6,171 KB)
[v2] Wed, 8 Nov 2023 08:37:40 UTC (7,272 KB)

Computer Science > Robotics

Title:Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators