An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Xie, Qiujie; Li, Qingqiu; Yu, Zhuohao; Zhang, Yuejie; Zhang, Yue; Yang, Linyi

Computer Science > Computation and Language

arXiv:2502.10709 (cs)

[Submitted on 15 Feb 2025]

Title:An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Authors:Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang

View PDF HTML (experimental)

Abstract:As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: this https URL.

Comments:	ICLR 2025
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.10709 [cs.CL]
	(or arXiv:2502.10709v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.10709

Submission history

From: Qiujie Xie [view email]
[v1] Sat, 15 Feb 2025 07:45:20 UTC (1,500 KB)

Computer Science > Computation and Language

Title:An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators