Situational Awareness Matters in 3D Vision Language Reasoning

Man, Yunze; Gui, Liang-Yan; Wang, Yu-Xiong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.07544 (cs)

[Submitted on 11 Jun 2024 (v1), last revised 26 Jun 2024 (this version, v2)]

Title:Situational Awareness Matters in 3D Vision Language Reasoning

Authors:Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

View PDF HTML (experimental)

Abstract:Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.

Comments:	CVPR 2024. Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2406.07544 [cs.CV]
	(or arXiv:2406.07544v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.07544

Submission history

From: Yunze Man [view email]
[v1] Tue, 11 Jun 2024 17:59:45 UTC (2,946 KB)
[v2] Wed, 26 Jun 2024 17:59:50 UTC (2,946 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Situational Awareness Matters in 3D Vision Language Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Situational Awareness Matters in 3D Vision Language Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators