Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective

Kim, Minsang; Baek, Seungjun

Computer Science > Artificial Intelligence

arXiv:2406.14124v2 (cs)

[Submitted on 20 Jun 2024 (v1), last revised 21 Jun 2024 (this version, v2)]

Title:Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective

Authors:Minsang Kim, Seungjun Baek

View PDF HTML (experimental)

Abstract:Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.14124 [cs.AI]
	(or arXiv:2406.14124v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2406.14124

Submission history

From: Minsang Kim [view email]
[v1] Thu, 20 Jun 2024 09:09:34 UTC (1,360 KB)
[v2] Fri, 21 Jun 2024 02:30:32 UTC (1,360 KB)

Computer Science > Artificial Intelligence

Title:Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators