PaperBench: Evaluating AI's Ability to Replicate AI Research

Starace, Giulio; Jaffe, Oliver; Sherburn, Dane; Aung, James; Chan, Jun Shern; Maksin, Leon; Dias, Rachel; Mays, Evan; Kinsella, Benjamin; Thompson, Wyatt; Heidecke, Johannes; Glaese, Amelia; Patwardhan, Tejal

Computer Science > Artificial Intelligence

arXiv:2504.01848 (cs)

[Submitted on 2 Apr 2025 (v1), last revised 7 Apr 2025 (this version, v3)]

Title:PaperBench: Evaluating AI's Ability to Replicate AI Research

Authors:Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan

View PDF HTML (experimental)

Abstract:We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (this https URL) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Comments:	30 pages, 14 figures
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2504.01848 [cs.AI]
	(or arXiv:2504.01848v3 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2504.01848

Submission history

From: James Aung [view email]
[v1] Wed, 2 Apr 2025 15:55:24 UTC (4,982 KB)
[v2] Fri, 4 Apr 2025 12:44:57 UTC (4,982 KB)
[v3] Mon, 7 Apr 2025 12:15:49 UTC (4,982 KB)

Computer Science > Artificial Intelligence

Title:PaperBench: Evaluating AI's Ability to Replicate AI Research

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:PaperBench: Evaluating AI's Ability to Replicate AI Research

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators