Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

Pranida, Salsabila Zahirah; Genadi, Rifo Ahmad; Koto, Fajri

Computer Science > Computation and Language

arXiv:2502.12932 (cs)

[Submitted on 18 Feb 2025]

Title:Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

Authors:Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto

View PDF HTML (experimental)

Abstract:Quantifying reasoning capability in low-resource languages remains a challenge in NLP due to data scarcity and limited access to annotators. While LLM-assisted dataset construction has proven useful for medium- and high-resource languages, its effectiveness in low-resource languages, particularly for commonsense reasoning, is still unclear. In this paper, we compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset. We focus on Javanese and Sundanese, two major local languages in Indonesia, and evaluate the effectiveness of open-weight and closed-weight LLMs in assisting dataset creation through extensive manual validation. To assess the utility of synthetic data, we fine-tune language models on classification and generation tasks using this data and evaluate performance on a human-written test set. Our findings indicate that LLM-assisted data creation outperforms machine translation.

Comments:	18 pages total: 8 pages of main body, 6 pages of appendix. 4 figures in main body, 6 figures in appendix. Submitted to ARR on February 2025
Subjects:	Computation and Language (cs.CL)
MSC classes:	68T50
ACM classes:	I.2.7
Cite as:	arXiv:2502.12932 [cs.CL]
	(or arXiv:2502.12932v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.12932

Submission history

From: Salsabila Zahirah Pranida [view email]
[v1] Tue, 18 Feb 2025 15:14:58 UTC (10,800 KB)

Computer Science > Computation and Language

Title:Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators