Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Park, Jaewoo; Park, Jungyang; Jang, Dongju; Chung, Jiwan; Yoo, Byungwoo; Shin, Jaewoo; Park, Seonjoon; Kim, Taehyeong; Yu, Youngjae

Computer Science > Computation and Language

arXiv:2504.03197v1 (cs)

[Submitted on 4 Apr 2025 (this version), latest version 7 Apr 2025 (v2)]

Title:Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Authors:Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu

View PDF HTML (experimental)

Abstract:With the rapid advancement of mathematical reasoning capabilities in large language models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: visual explanation. In real-world instructional contexts, human tutors routinely employ visual aids-such as diagrams, markings, and highlights-to enhance conceptual clarity. To bridge this gap, we introduce a novel task of visual solution explanation, which requires not only solving problems but also generating explanations that incorporate newly introduced visual elements essential for understanding (e.g., auxiliary lines, annotations, or geometric constructions). To evaluate model performance on this task, we propose MathExplain, a multimodal benchmark consisting of 997 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that while some closed-source models demonstrate promising capabilities on visual solution-explaining, current open-source general-purpose models perform inconsistently, particularly in identifying relevant visual components and producing coherent keypoint-based explanations. We expect that visual solution-explaining and the MathExplain dataset will catalyze further research on multimodal LLMs in education and advance their deployment as effective, explanation-oriented AI tutors. Code and data will be released publicly.

Comments:	18 pages, 4 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2504.03197 [cs.CL]
	(or arXiv:2504.03197v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.03197

Submission history

From: Jaewoo Park [view email]
[v1] Fri, 4 Apr 2025 06:03:13 UTC (3,623 KB)
[v2] Mon, 7 Apr 2025 14:23:25 UTC (3,624 KB)

Computer Science > Computation and Language

Title:Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators