Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Wang, Haochen; Zhao, Yucheng; Wang, Tiancai; Fan, Haoqiang; Zhang, Xiangyu; Zhang, Zhaoxiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.01901 (cs)

[Submitted on 2 Apr 2025]

Title:Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Authors:Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Zhaoxiang Zhang

View PDF HTML (experimental)

Abstract:The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Cite as:	arXiv:2504.01901 [cs.CV]
	(or arXiv:2504.01901v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.01901

Submission history

From: Haochen Wang [view email]
[v1] Wed, 2 Apr 2025 16:59:55 UTC (921 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators