3D Vision and Language Pretraining with Large-Scale Synthetic Data

Yang, Dejie; Xu, Zhu; Mo, Wentao; Chen, Qingchao; Huang, Siyuan; Liu, Yang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.06084 (cs)

[Submitted on 8 Jul 2024]

Title:3D Vision and Language Pretraining with Large-Scale Synthetic Data

Authors:Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu

View PDF HTML (experimental)

Abstract:3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering.

Comments:	accepted by IJCAI2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.06084 [cs.CV]
	(or arXiv:2407.06084v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.06084

Submission history

From: Dejie Yang [view email]
[v1] Mon, 8 Jul 2024 16:26:52 UTC (434 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:3D Vision and Language Pretraining with Large-Scale Synthetic Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:3D Vision and Language Pretraining with Large-Scale Synthetic Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators