ProcessBench: Identifying Process Errors in Mathematical Reasoning

Zheng, Chujie; Zhang, Zhenru; Zhang, Beichen; Lin, Runji; Lu, Keming; Yu, Bowen; Liu, Dayiheng; Zhou, Jingren; Lin, Junyang

Computer Science > Artificial Intelligence

arXiv:2412.06559 (cs)

[Submitted on 9 Dec 2024 (v1), last revised 10 Dec 2024 (this version, v2)]

Title:ProcessBench: Identifying Process Errors in Mathematical Reasoning

Authors:Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin

View PDF

Abstract:As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2412.06559 [cs.AI]
	(or arXiv:2412.06559v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2412.06559

Submission history

From: Chujie Zheng [view email]
[v1] Mon, 9 Dec 2024 15:11:40 UTC (437 KB)
[v2] Tue, 10 Dec 2024 08:10:32 UTC (437 KB)

Computer Science > Artificial Intelligence

Title:ProcessBench: Identifying Process Errors in Mathematical Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:ProcessBench: Identifying Process Errors in Mathematical Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators