Idiom Detection in Sorani Kurdish Texts

Omer, Skala Kamaran; Hassani, Hossein

Computer Science > Computation and Language

arXiv:2501.14528 (cs)

[Submitted on 24 Jan 2025 (v1), last revised 30 Jan 2025 (this version, v2)]

Title:Idiom Detection in Sorani Kurdish Texts

Authors:Skala Kamaran Omer, Hossein Hassani

View PDF HTML (experimental)

Abstract:Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.

Comments:	22 pages, 8 figures, 7 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.14528 [cs.CL]
	(or arXiv:2501.14528v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.14528

Submission history

From: Hossein Hassani [view email]
[v1] Fri, 24 Jan 2025 14:31:30 UTC (2,342 KB)
[v2] Thu, 30 Jan 2025 10:15:35 UTC (2,322 KB)

Computer Science > Computation and Language

Title:Idiom Detection in Sorani Kurdish Texts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Idiom Detection in Sorani Kurdish Texts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators