From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Kornilov, Albert; Shavrina, Tatiana

Computer Science > Computation and Language

arXiv:2411.15577 (cs)

[Submitted on 23 Nov 2024]

Title:From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Authors:Albert Kornilov, Tatiana Shavrina

View PDF HTML (experimental)

Abstract:Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary.
In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank.
This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \url{this https URL}.

Comments:	submitted to COLING 2025
Subjects:	Computation and Language (cs.CL)
MSC classes:	68-06, 68T50, 68T01
ACM classes:	G.3; I.2.7
Cite as:	arXiv:2411.15577 [cs.CL]
	(or arXiv:2411.15577v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.15577

Submission history

From: Tatiana Shavrina [view email]
[v1] Sat, 23 Nov 2024 14:47:10 UTC (142 KB)

Computer Science > Computation and Language

Title:From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators