IndicBART: A Pre-trained Model for Indic Natural Language Generation

Dabre, Raj; Shrotriya, Himani; Kunchukuttan, Anoop; Puduppully, Ratish; Khapra, Mitesh M.; Kumar, Pratyush

doi:10.18653/v1/2022.findings-acl.145

Computer Science > Computation and Language

arXiv:2109.02903 (cs)

[Submitted on 7 Sep 2021 (v1), last revised 27 Oct 2022 (this version, v2)]

Title:IndicBART: A Pre-trained Model for Indic Natural Language Generation

Authors:Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar

View PDF

Abstract:In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning. Script sharing, multilingual training, and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.

Comments:	Published at ACL 2022, 15 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2109.02903 [cs.CL]
	(or arXiv:2109.02903v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2109.02903
Related DOI:	https://doi.org/10.18653/v1/2022.findings-acl.145

Submission history

From: Prasanna Raj Noel Dabre [view email]
[v1] Tue, 7 Sep 2021 07:08:33 UTC (5,405 KB)
[v2] Thu, 27 Oct 2022 02:53:18 UTC (5,418 KB)

Computer Science > Computation and Language

Title:IndicBART: A Pre-trained Model for Indic Natural Language Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:IndicBART: A Pre-trained Model for Indic Natural Language Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators