Unsupervised Bilingual Lexicon Induction for Low Resource Languages

Rathnayake, Charitha; Thilakarathna, P. R. S.; Nethmini, Uthpala; Kaur, Rishemjith; Ranathunga, Surangika

Computer Science > Computation and Language

arXiv:2412.16894 (cs)

[Submitted on 22 Dec 2024]

Title:Unsupervised Bilingual Lexicon Induction for Low Resource Languages

Authors:Charitha Rathnayake, P.R.S. Thilakarathna, Uthpala Nethmini, Rishemjith Kaur, Surangika Ranathunga

View PDF HTML (experimental)

Abstract:Bilingual lexicons play a crucial role in various Natural Language Processing tasks. However, many low-resource languages (LRLs) do not have such lexicons, and due to the same reason, cannot benefit from the supervised Bilingual Lexicon Induction (BLI) techniques. To address this, unsupervised BLI (UBLI) techniques were introduced. A prominent technique in this line is structure-based UBLI. It is an iterative method, where a seed lexicon, which is initially learned from monolingual embeddings is iteratively improved. There have been numerous improvements to this core idea, however they have been experimented with independently of each other. In this paper, we investigate whether using these techniques simultaneously would lead to equal gains. We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework, and carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi. These experiments helped us to identify the best combination of the extensions. We also release bilingual dictionaries for English-Sinhala and English-Punjabi.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2412.16894 [cs.CL]
	(or arXiv:2412.16894v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.16894

Submission history

From: Charitha Rathnayake [view email]
[v1] Sun, 22 Dec 2024 07:07:09 UTC (847 KB)

Computer Science > Computation and Language

Title:Unsupervised Bilingual Lexicon Induction for Low Resource Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unsupervised Bilingual Lexicon Induction for Low Resource Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators