Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Karch, Tristan; Engel, Luca; Schwaller, Philippe; Kaplan, Frédéric

Computer Science > Computation and Language

arXiv:2502.13691 (cs)

[Submitted on 19 Feb 2025]

Title:Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Authors:Tristan Karch, Luca Engel, Philippe Schwaller, Frédéric Kaplan

View PDF HTML (experimental)

Abstract:As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using three strategically selected datasets: EPFL PhD manuscripts (likely containing novel specialized knowledge), Wikipedia articles (presumably part of training data), and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.13691 [cs.CL]
	(or arXiv:2502.13691v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.13691

Submission history

From: Tristan Karch [view email]
[v1] Wed, 19 Feb 2025 13:03:06 UTC (1,574 KB)

Computer Science > Computation and Language

Title:Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators