Dealing with Synthetic Data Contamination in Online Continual Learning

Wang, Maorong; Michel, Nicolas; Mao, Jiafeng; Yamasaki, Toshihiko

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.13852 (cs)

[Submitted on 21 Nov 2024]

Title:Dealing with Synthetic Data Contamination in Online Continual Learning

Authors:Maorong Wang, Nicolas Michel, Jiafeng Mao, Toshihiko Yamasaki

View PDF

Abstract:Image generation has shown remarkable results in generating high-fidelity realistic images, in particular with the advancement of diffusion-based models. However, the prevalence of AI-generated images may have side effects for the machine learning community that are not clearly identified. Meanwhile, the success of deep learning in computer vision is driven by the massive dataset collected on the Internet. The extensive quantity of synthetic data being added to the Internet would become an obstacle for future researchers to collect "clean" datasets without AI-generated content. Prior research has shown that using datasets contaminated by synthetic images may result in performance degradation when used for training. In this paper, we investigate the potential impact of contaminated datasets on Online Continual Learning (CL) research. We experimentally show that contaminated datasets might hinder the training of existing online CL methods. Also, we propose Entropy Selection with Real-synthetic similarity Maximization (ESRM), a method to alleviate the performance deterioration caused by synthetic images when training online CL models. Experiments show that our method can significantly alleviate performance deterioration, especially when the contamination is severe. For reproducibility, the source code of our work is available at this https URL.

Comments:	Accepted to NeurIPS'24
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2411.13852 [cs.CV]
	(or arXiv:2411.13852v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.13852

Submission history

From: Maorong Wang [view email]
[v1] Thu, 21 Nov 2024 05:24:35 UTC (8,224 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Dealing with Synthetic Data Contamination in Online Continual Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Dealing with Synthetic Data Contamination in Online Continual Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators