ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Syromiatnikov, Mykyta; Ruvinskaya, Victoria; Troynina, Anastasiya

doi:10.15276/ict.01.2024.27

Computer Science > Computation and Language

arXiv:2501.06715 (cs)

[Submitted on 12 Jan 2025]

Title:ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Authors:Mykyta Syromiatnikov, Victoria Ruvinskaya, Anastasiya Troynina

View PDF

Abstract:As the usage of large language models for problems outside of simple text understanding or generation increases, assessing their abilities and limitations becomes crucial. While significant progress has been made in this area over the last few years, most research has focused on benchmarking English, leaving other languages underexplored. This makes evaluating the reasoning and robustness level of language models in Ukrainian particularly challenging. The purpose of this work is to establish a comprehensive benchmark for the reasoning capabilities evaluation of large language models in the Ukrainian language. This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine's standardized educational testing system: the External Independent Evaluation and the National Multi-subject Test. With single-answer options, multiple-choice, matching, and open-ended questions from diverse subjects, including Ukrainian language, mathematics, history, and geography, this dataset paves the way toward a thorough analysis of reasoning capabilities across different domains and complexities. Evaluation of several well-known language models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large, Claude 3 Opus, and Gemini-1.5 Pro on this benchmark demonstrated the superiority of GPT-4o in both common knowledge reasoning and intricate language tasks. At the same time, Gemini Pro and GPT-4 Turbo excelled in the arithmetic domain, leading in single-answer and open-ended math problems. While all models were close to max performance in text-only common knowledge tasks like history and geography, there still is a gap for Ukrainian language and math, thus highlighting the importance of developing specialized language benchmarks for more accurate assessments of model capabilities and limitations across different languages and contexts.

Comments:	7 pages, 5 figures. X International conference "Informatics. Culture. Technology." (2024)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.06715 [cs.CL]
	(or arXiv:2501.06715v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.06715
Journal reference:	X International conference "Informatics. Culture. Technology." (2024) 185-191
Related DOI:	https://doi.org/10.15276/ict.01.2024.27

Submission history

From: Mykyta Syromiatnikov [view email]
[v1] Sun, 12 Jan 2025 04:49:06 UTC (1,115 KB)

Computer Science > Computation and Language

Title:ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators