Benchmarks as Microscopes: A Call for Model Metrology

Saxon, Michael; Holtzman, Ari; West, Peter; Wang, William Yang; Saphra, Naomi

Computer Science > Software Engineering

arXiv:2407.16711 (cs)

[Submitted on 22 Jul 2024 (v1), last revised 30 Jul 2024 (this version, v2)]

Title:Benchmarks as Microscopes: A Call for Model Metrology

Authors:Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi Saphra

View PDF

Abstract:Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.

Comments:	Conference paper at COLM 2024
Subjects:	Software Engineering (cs.SE); Computation and Language (cs.CL)
Cite as:	arXiv:2407.16711 [cs.SE]
	(or arXiv:2407.16711v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2407.16711

Submission history

From: Michael Saxon [view email]
[v1] Mon, 22 Jul 2024 17:52:12 UTC (123 KB)
[v2] Tue, 30 Jul 2024 04:48:26 UTC (123 KB)

Computer Science > Software Engineering

Title:Benchmarks as Microscopes: A Call for Model Metrology

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Benchmarks as Microscopes: A Call for Model Metrology

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators