From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Pulli, Tessa; Thalhammer, Stefan; Schwaiger, Simon; Vincze, Markus

Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.05413 (cs)

[Submitted on 9 Sep 2024]

Title:From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Authors:Tessa Pulli, Stefan Thalhammer, Simon Schwaiger, Markus Vincze

View PDF HTML (experimental)

Abstract:Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:	arXiv:2409.05413 [cs.CV]
	(or arXiv:2409.05413v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.05413

Submission history

From: Simon Schwaiger [view email]
[v1] Mon, 9 Sep 2024 08:15:39 UTC (3,564 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators