Exploring Learning Complexity for Efficient Downstream Dataset Pruning

Jiang, Wenyu; Liu, Zhenlong; Xie, Zejian; Zhang, Songxin; Jing, Bingyi; Wei, Hongxin

Computer Science > Machine Learning

arXiv:2402.05356 (cs)

[Submitted on 8 Feb 2024 (v1), last revised 8 Oct 2024 (this version, v2)]

Title:Exploring Learning Complexity for Efficient Downstream Dataset Pruning

Authors:Wenyu Jiang, Zhenlong Liu, Zejian Xie, Songxin Zhang, Bingyi Jing, Hongxin Wei

View PDF HTML (experimental)

Abstract:The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2402.05356 [cs.LG]
	(or arXiv:2402.05356v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.05356

Submission history

From: Wenyu Jiang [view email]
[v1] Thu, 8 Feb 2024 02:29:33 UTC (1,217 KB)
[v2] Tue, 8 Oct 2024 13:56:33 UTC (5,556 KB)

Computer Science > Machine Learning

Title:Exploring Learning Complexity for Efficient Downstream Dataset Pruning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Exploring Learning Complexity for Efficient Downstream Dataset Pruning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators