Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

Xu, Yue; Li, Yong-Lu; Cui, Kaitong; Wang, Ziyu; Lu, Cewu; Tai, Yu-Wing; Tang, Chi-Keung

Computer Science > Machine Learning

arXiv:2305.18381v1 (cs)

[Submitted on 28 May 2023 (this version), latest version 7 Aug 2024 (v4)]

Title:Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

Authors:Yue Xu, Yong-Lu Li, Kaitong Cui, Ziyu Wang, Cewu Lu, Yu-Wing Tai, Chi-Keung Tang

View PDF

Abstract:Data-efficient learning has drawn significant attention, especially given the current trend of large multi-modal models, where dataset distillation can be an effective solution. However, the dataset distillation process itself is still very inefficient. In this work, we model the distillation problem with reference to information theory. Observing that severe data redundancy exists in dataset distillation, we argue to put more emphasis on the utility of the training samples. We propose a family of methods to exploit the most valuable samples, which is validated by our comprehensive analysis of the optimal data selection. The new strategy significantly reduces the training cost and extends a variety of existing distillation algorithms to larger and more diversified datasets, e.g. in some cases only 0.04% training data is sufficient for comparable distillation performance. Moreover, our strategy consistently enhances the performance, which may open up new analyses on the dynamics of distillation and networks. Our method is able to extend the distillation algorithms to much larger-scale datasets and more heterogeneous datasets, e.g. ImageNet-1K and Kinetics-400. Our code will be made publicly available.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.18381 [cs.LG]
	(or arXiv:2305.18381v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2305.18381

Submission history

From: Yue Xu [view email]
[v1] Sun, 28 May 2023 06:53:41 UTC (1,276 KB)
[v2] Fri, 3 Nov 2023 14:24:45 UTC (1,277 KB)
[v3] Wed, 29 Nov 2023 10:46:19 UTC (1,255 KB)
[v4] Wed, 7 Aug 2024 12:59:31 UTC (1,392 KB)

Computer Science > Machine Learning

Title:Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators