Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Huang, Haifeng; Chen, Yilun; Wang, Zehan; Huang, Rongjie; Xu, Runsen; Wang, Tai; Liu, Luping; Cheng, Xize; Zhao, Yang; Pang, Jiangmiao; Zhao, Zhou

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.08168 (cs)

[Submitted on 13 Dec 2023 (v1), last revised 28 Sep 2024 (this version, v4)]

Title:Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Authors:Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao

View PDF HTML (experimental)

Abstract:Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.08168 [cs.CV]
	(or arXiv:2312.08168v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.08168

Submission history

From: Haifeng Huang [view email]
[v1] Wed, 13 Dec 2023 14:27:45 UTC (4,084 KB)
[v2] Fri, 15 Dec 2023 06:15:33 UTC (4,089 KB)
[v3] Thu, 26 Sep 2024 16:51:37 UTC (5,081 KB)
[v4] Sat, 28 Sep 2024 03:56:28 UTC (5,577 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators