Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Arnaud, Sergio; McVay, Paul; Martin, Ada; Majumdar, Arjun; Jatavallabhula, Krishna Murthy; Thomas, Phillip; Partsey, Ruslan; Dugas, Daniel; Gejji, Abha; Sax, Alexander; Berges, Vincent-Pierre; Henaff, Mikael; Jain, Ayush; Cao, Ang; Prasad, Ishita; Kalakrishnan, Mrinal; Rabbat, Michael; Ballas, Nicolas; Assran, Mido; Maksymets, Oleksandr; Rajeswaran, Aravind; Meier, Franziska

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.14151 (cs)

[Submitted on 19 Apr 2025]

Title:Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Abstract:We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
ACM classes:	I.2.10; I.2.6; I.2.9; I.3.7; I.4.6; I.4.8
Cite as:	arXiv:2504.14151 [cs.CV]
	(or arXiv:2504.14151v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.14151

Submission history

From: Sergio Arnaud [view email]
[v1] Sat, 19 Apr 2025 02:51:24 UTC (38,420 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators