Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Sun, Haochen; Zhang, Shuwen; Ren, Lei; Xu, Hao; Fu, Hao; Yuan, Caixia; Wang, Xiaojie

Computer Science > Computation and Language

arXiv:2502.20073 (cs)

[Submitted on 27 Feb 2025]

Title:Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Authors:Haochen Sun, Shuwen Zhang, Lei Ren, Hao Xu, Hao Fu, Caixia Yuan, Xiaojie Wang

View PDF HTML (experimental)

Abstract:Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 10 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaption that are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. Environments, 30 open-ended tasks, and an integrated evaluation package are now publicly available at this https URL.

Comments:	25 pages, 14 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Cite as:	arXiv:2502.20073 [cs.CL]
	(or arXiv:2502.20073v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.20073

Submission history

From: Haochen Sun [view email]
[v1] Thu, 27 Feb 2025 13:31:13 UTC (6,742 KB)

Computer Science > Computation and Language

Title:Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators