On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Ahuja, Kabir; Choudhury, Monojit; Dandapat, Sandipan

Computer Science > Computation and Language

arXiv:2205.06350 (cs)

[Submitted on 12 May 2022 (v1), last revised 14 Nov 2022 (this version, v2)]

Title:On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Authors:Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat

View PDF

Abstract:Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP.

Comments:	NAACL 2022
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.06350 [cs.CL]
	(or arXiv:2205.06350v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.06350

Submission history

From: Kabir Ahuja [view email]
[v1] Thu, 12 May 2022 20:27:01 UTC (11,103 KB)
[v2] Mon, 14 Nov 2022 15:48:47 UTC (14,022 KB)

Computer Science > Computation and Language

Title:On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators