NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Hasan, Md. Arid; Hasanain, Maram; Ahmad, Fatema; Laskar, Sahinur Rahman; Upadhyay, Sunaya; Sukhadia, Vrunda N; Kutlu, Mucahid; Chowdhury, Shammur Absar; Alam, Firoj

Computer Science > Computation and Language

arXiv:2407.09823 (cs)

[Submitted on 13 Jul 2024 (v1), last revised 6 Oct 2024 (this version, v2)]

Title:NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Authors:Md. Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, Firoj Alam

View PDF

Abstract:Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed, there is a notable lack of region-specific datasets generated by native users in their own languages. This gap hinders the effective benchmarking of LLMs for regional and cultural specificities. Furthermore, it also limits the development of fine-tuned models. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, \mnqa, consisting of ~64k manually annotated QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark open- and closed-source LLMs with the MultiNativQA dataset. We also showcase the framework efficacy in constructing fine-tuning data especially for low-resource and dialectally-rich languages. We made both the framework NativQA and MultiNativQA dataset publicly available for the community (this https URL).

Comments:	LLMs, Native, Multilingual, Language Diversity, Contextual Understanding, Minority Languages, Culturally Informed, Foundation Models, Large Language Models
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
MSC classes:	68T50
ACM classes:	F.2.2; I.2.7
Cite as:	arXiv:2407.09823 [cs.CL]
	(or arXiv:2407.09823v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.09823

Submission history

From: Firoj Alam [view email]
[v1] Sat, 13 Jul 2024 09:34:00 UTC (4,332 KB)
[v2] Sun, 6 Oct 2024 10:46:41 UTC (6,266 KB)

Computer Science > Computation and Language

Title:NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators