Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

Lin, Sihao; Liu, Daqi; Fu, Ruochong; Liu, Dongrui; Song, Andy; Xie, Hongwei; Li, Zhihui; Wang, Bing; Chang, Xiaojun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.07125 (cs)

[Submitted on 10 Mar 2025]

Title:Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

Authors:Sihao Lin, Daqi Liu, Ruochong Fu, Dongrui Liu, Andy Song, Hongwei Xie, Zhihui Li, Bing Wang, Xiaojun Chang

View PDF HTML (experimental)

Abstract:Estimating the 3D world from 2D monocular images is a fundamental yet challenging task due to the labour-intensive nature of 3D annotations. To simplify label acquisition, this work proposes a novel approach that bridges 2D vision foundation models (VFMs) with 3D tasks by decoupling 3D supervision into an ensemble of image-level primitives, e.g., semantic and geometric components. As a key motivator, we leverage the zero-shot capabilities of vision-language models for image semantics. However, due to the notorious ill-posed problem - multiple distinct 3D scenes can produce identical 2D projections, directly inferring metric depth from a monocular image in a zero-shot manner is unsuitable. In contrast, 2D VFMs provide promising sources of relative depth, which theoretically aligns with metric depth when properly scaled and offset. Thus, we adapt the relative depth derived from VFMs into metric depth by optimising the scale and offset using temporal consistency, also known as novel view synthesis, without access to ground-truth metric depth. Consequently, we project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision. Extensive experiments on nuScenes and SemanticKITTI demonstrate the effectiveness of our framework. For instance, the proposed method surpasses the current state-of-the-art by 3.34% mIoU on nuScenes for voxel occupancy prediction.

Comments:	preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.07125 [cs.CV]
	(or arXiv:2503.07125v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.07125

Submission history

From: Sihao Lin [view email]
[v1] Mon, 10 Mar 2025 09:54:40 UTC (2,385 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators