EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Cheng, Zhili; Tu, Yuge; Li, Ran; Dai, Shiqi; Hu, Jinyi; Hu, Shengding; Li, Jiahao; Shi, Yang; Yu, Tianyu; Chen, Weize; Shi, Lei; Sun, Maosong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.11858 (cs)

[Submitted on 21 Jan 2025]

Title:EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Authors:Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun

View PDF HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2501.11858 [cs.CV]
	(or arXiv:2501.11858v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.11858

Submission history

From: Zhili Cheng [view email]
[v1] Tue, 21 Jan 2025 03:22:10 UTC (15,985 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators