Towards Reliable Testing for Multiple Information Retrieval System Comparisons

Otero, David; Parapar, Javier; Barreiro, Álvaro

Computer Science > Information Retrieval

arXiv:2501.03930 (cs)

[Submitted on 7 Jan 2025]

Title:Towards Reliable Testing for Multiple Information Retrieval System Comparisons

Authors:David Otero, Javier Parapar, Álvaro Barreiro

View PDF HTML (experimental)

Abstract:Null Hypothesis Significance Testing is the \textit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.

Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:2501.03930 [cs.IR]
	(or arXiv:2501.03930v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2501.03930

Submission history

From: David Otero [view email]
[v1] Tue, 7 Jan 2025 16:48:21 UTC (331 KB)

Computer Science > Information Retrieval

Title:Towards Reliable Testing for Multiple Information Retrieval System Comparisons

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Towards Reliable Testing for Multiple Information Retrieval System Comparisons

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators