Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Zhu, Qin; Cheng, Qingyuan; Peng, Runyu; Li, Xiaonan; Liu, Tengxiao; Peng, Ru; Qiu, Xipeng; Huang, Xuanjing

Computer Science > Computation and Language

arXiv:2406.13990 (cs)

[Submitted on 20 Jun 2024 (v1), last revised 23 Jun 2024 (this version, v2)]

Title:Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Authors:Qin Zhu, Qingyuan Cheng, Runyu Peng, Xiaonan Li, Tengxiao Liu, Ru Peng, Xipeng Qiu, Xuanjing Huang

View PDF HTML (experimental)

Abstract:The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.13990 [cs.CL]
	(or arXiv:2406.13990v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.13990

Submission history

From: Qin Zhu [view email]
[v1] Thu, 20 Jun 2024 04:35:59 UTC (1,664 KB)
[v2] Sun, 23 Jun 2024 16:46:00 UTC (1,664 KB)

Computer Science > Computation and Language

Title:Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators