More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Tang, Yuan; Han, Xu; Li, Xianzhi; Yu, Qiao; Xu, Jinfeng; Hao, Yixue; Hu, Long; Chen, Min

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.15966 (cs)

[Submitted on 28 Aug 2024 (v1), last revised 5 Sep 2024 (this version, v2)]

Title:More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Authors:Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Jinfeng Xu, Yixue Hao, Long Hu, Min Chen

View PDF HTML (experimental)

Abstract:Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2408.15966 [cs.CV]
	(or arXiv:2408.15966v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.15966

Submission history

From: Yuan Tang [view email]
[v1] Wed, 28 Aug 2024 17:38:44 UTC (10,684 KB)
[v2] Thu, 5 Sep 2024 06:33:31 UTC (10,684 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators