Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo

Mbogho, Audrey; Awuor, Quin; Kipkebut, Andrew; Wanzare, Lilian; Oloo, Vivian

Abstract:Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.

Comments:	13 pages, 1 figure, intend to submit to a Springer Nature journal
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.11003 [cs.CL]
	(or arXiv:2501.11003v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.11003

Computer Science > Computation and Language

Title:Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators