Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Schaeffer, Rylan; Koura, Punit Singh; Tang, Binh; Subramanian, Ranjan; Singh, Aaditya K; Mihaylov, Todor; Bhargava, Prajjwal; Madaan, Lovish; Chatterji, Niladri S.; Goswami, Vedanuj; Edunov, Sergey; Hupkes, Dieuwke; Koyejo, Sanmi; Narang, Sharan

Computer Science > Computation and Language

arXiv:2502.18339 (cs)

[Submitted on 24 Feb 2025]

Title:Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Authors:Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang

View PDF HTML (experimental)

Abstract:The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2502.18339 [cs.CL]
	(or arXiv:2502.18339v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.18339

Submission history

From: Rylan Schaeffer [view email]
[v1] Mon, 24 Feb 2025 01:01:02 UTC (3,650 KB)

Computer Science > Computation and Language

Title:Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators