Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning

Yu, Fei; Li, Yingru; Wang, Benyou

Abstract:Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement. Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths. However, we identify a critical limitation: scaling flaws, prevalent across different models (Mistral 7B and DeepSeekMath 7B), benchmarks (GSM8K and MATH), and verifiers (outcome value models and process reward models). As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling. Our analysis attributes this to verifier failures, where imperfect verifiers misrank candidates and erroneously prune all valid paths. These issues are further exacerbated in challenging and out-of-distribution problems, restricting search effectiveness. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods. Our findings reveal fundamental limitations in verifier-guided search and suggest future directions.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.00271 [cs.CL]
	(or arXiv:2502.00271v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.00271

Computer Science > Computation and Language

Title:Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators