On the Calibration of Multilingual Question Answering LLMs

Yang, Yahan; Dan, Soham; Roth, Dan; Lee, Insup

Computer Science > Computation and Language

arXiv:2311.08669 (cs)

[Submitted on 15 Nov 2023 (v1), last revised 15 Apr 2024 (this version, v2)]

Title:On the Calibration of Multilingual Question Answering LLMs

Authors:Yahan Yang, Soham Dan, Dan Roth, Insup Lee

View PDF HTML (experimental)

Abstract:Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.

Comments:	Preprint. Under Submission
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2311.08669 [cs.CL]
	(or arXiv:2311.08669v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.08669

Submission history

From: Yahan Yang [view email]
[v1] Wed, 15 Nov 2023 03:29:02 UTC (345 KB)
[v2] Mon, 15 Apr 2024 14:44:04 UTC (1,632 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computation and Language

Title:On the Calibration of Multilingual Question Answering LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computation and Language

Title:On the Calibration of Multilingual Question Answering LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators