Towards Label-free Scene Understanding by Vision Foundation Models

Chen, Runnan; Liu, Youquan; Kong, Lingdong; Chen, Nenglun; Zhu, Xinge; Ma, Yuexin; Liu, Tongliang; Wang, Wenping

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.03899 (cs)

[Submitted on 6 Jun 2023 (v1), last revised 30 Oct 2023 (this version, v2)]

Title:Towards Label-free Scene Understanding by Vision Foundation Models

Authors:Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen, Xinge Zhu, Yuexin Ma, Tongliang Liu, Wenping Wang

View PDF

Abstract:Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4\% and 33.5\% mIoU on ScanNet, improving 4.7\% and 7.9\%, respectively. For nuImages and nuScenes datasets, the performance is 22.1\% and 26.8\% with improvements of 3.5\% and 6.0\%, respectively. Code is available. (this https URL).

Comments:	NeurIPS 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.03899 [cs.CV]
	(or arXiv:2306.03899v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.03899

Submission history

From: Runnan Chen Dr. [view email]
[v1] Tue, 6 Jun 2023 17:57:49 UTC (49,504 KB)
[v2] Mon, 30 Oct 2023 15:27:57 UTC (8,394 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Label-free Scene Understanding by Vision Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Label-free Scene Understanding by Vision Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators