Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Ghaddar, Abbas; Wu, Yimeng; Bagga, Sunyam; Rashid, Ahmad; Bibi, Khalil; Rezagholizadeh, Mehdi; Xing, Chao; Wang, Yasheng; Xinyu, Duan; Wang, Zhefeng; Huai, Baoxing; Jiang, Xin; Liu, Qun; Langlais, Philippe

Computer Science > Computation and Language

arXiv:2205.10687 (cs)

[Submitted on 21 May 2022]

Title:Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Authors:Abbas Ghaddar, Yimeng Wu, Sunyam Bagga, Ahmad Rashid, Khalil Bibi, Mehdi Rezagholizadeh, Chao Xing, Yasheng Wang, Duan Xinyu, Zhefeng Wang, Baoxing Huai, Xin Jiang, Qun Liu, Philippe Langlais

View PDF

Abstract:There is a growing body of work in recent years to develop pre-trained language models (PLMs) for the Arabic language. This work concerns addressing two major problems in existing Arabic PLMs which constraint progress of the Arabic NLU and NLG this http URL, existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly using a more methodical approach. Second, there is a lack of systematic and reproducible evaluation of these models in the literature. In this work, we revisit both the pre-training and evaluation of Arabic PLMs. In terms of pre-training, we explore improving Arabic LMs from three perspectives: quality of the pre-training data, size of the model, and incorporating character-level information. As a result, we release three new Arabic BERT-style models ( JABER, Char-JABER, and SABER), and two T5-style models (AT5S and AT5B). In terms of evaluation, we conduct a comprehensive empirical study to systematically evaluate the performance of existing state-of-the-art models on ALUE that is a leaderboard-powered benchmark for Arabic NLU tasks, and on a subset of the ARGEN benchmark for Arabic NLG tasks. We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks. Our models and source code to reproduce of results will be made available shortly.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.10687 [cs.CL]
	(or arXiv:2205.10687v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.10687

Submission history

From: Abbas Ghaddar [view email]
[v1] Sat, 21 May 2022 22:38:19 UTC (129 KB)

Computer Science > Computation and Language

Title:Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators