Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Sun, Haoxiang; Min, Yingqian; Chen, Zhipeng; Zhao, Wayne Xin; Liu, Zheng; Wang, Zhongyuan; Fang, Lei; Wen, Ji-Rong

Computer Science > Computation and Language

arXiv:2503.21380 (cs)

[Submitted on 27 Mar 2025]

Title:Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Authors:Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, Ji-Rong Wen

View PDF HTML (experimental)

Abstract:In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: this https URL.

Comments:	Technical Report on Slow Thinking with LLMs: Evaluation Benchmark
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2503.21380 [cs.CL]
	(or arXiv:2503.21380v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.21380

Submission history

From: Haoxiang Sun [view email]
[v1] Thu, 27 Mar 2025 11:20:17 UTC (81 KB)

Computer Science > Computation and Language

Title:Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators