Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Satpute, Ankit; Giessing, Noah; Greiner-Petter, Andre; Schubotz, Moritz; Teschke, Olaf; Aizawa, Akiko; Gipp, Bela

Computer Science > Computation and Language

arXiv:2404.00344 (cs)

[Submitted on 30 Mar 2024]

Title:Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Authors:Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{this https URL}

Comments:	Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) July 14--18, 2024, Washington D.C.,USA
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:	arXiv:2404.00344 [cs.CL]
	(or arXiv:2404.00344v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.00344

Submission history

From: Bela Gipp [view email]
[v1] Sat, 30 Mar 2024 12:48:31 UTC (44 KB)

Computer Science > Computation and Language

Title:Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators