DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Guo, Xianda; Zhang, Ruijun; Duan, Yiqun; He, Yuhang; Zhang, Chenming; Liu, Shuai; Chen, Long

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.13112 (cs)

[Submitted on 20 Nov 2024 (v1), last revised 26 Nov 2024 (this version, v2)]

Title:DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Authors:Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Chenming Zhang, Shuai Liu, Long Chen

View PDF HTML (experimental)

Abstract:Autonomous driving requires a comprehensive understanding of 3D environments to facilitate high-level tasks such as motion prediction, planning, and mapping. In this paper, we introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving. DriveMLLM includes 880 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions. To measure MLLMs' performance, we propose novel evaluation metrics focusing on spatial understanding. We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts. We believe these findings underscore the need for more advanced MLLM-based spatial reasoning methods and highlight the potential for DriveMLLM to drive further research in autonomous driving. Code will be available at \url{this https URL}.

Comments:	Code will be available at \url{this https URL}
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2411.13112 [cs.CV]
	(or arXiv:2411.13112v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.13112

Submission history

From: Xianda Guo [view email]
[v1] Wed, 20 Nov 2024 08:14:01 UTC (2,834 KB)
[v2] Tue, 26 Nov 2024 07:24:04 UTC (2,833 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators