The Value of Out-of-Distribution Data

De Silva, Ashwin; Ramesh, Rahul; Priebe, Carey E.; Chaudhari, Pratik; Vogelstein, Joshua T.

Computer Science > Machine Learning

arXiv:2208.10967v2 (cs)

[Submitted on 23 Aug 2022 (v1), revised 6 Oct 2022 (this version, v2), latest version 13 Jul 2023 (v5)]

Title:The Value of Out-of-Distribution Data

Authors:Ashwin De Silva, Rahul Ramesh, Carey E. Priebe, Pratik Chaudhari, Joshua T. Vogelstein

View PDF

Abstract:More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior.

Comments:	To be presented as a short paper at the Out-of-Distribution Generalization in Computer Vision (OOD-CV) workshop, ECCV 2022, Tel Aviv, Israel
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as:	arXiv:2208.10967 [cs.LG]
	(or arXiv:2208.10967v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2208.10967

Submission history

From: Ashwin De Silva [view email]
[v1] Tue, 23 Aug 2022 13:41:01 UTC (2,907 KB)
[v2] Thu, 6 Oct 2022 10:12:00 UTC (4,119 KB)
[v3] Thu, 2 Feb 2023 03:31:21 UTC (4,163 KB)
[v4] Mon, 10 Jul 2023 09:15:22 UTC (685 KB)
[v5] Thu, 13 Jul 2023 10:02:22 UTC (685 KB)

Computer Science > Machine Learning

Title:The Value of Out-of-Distribution Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:The Value of Out-of-Distribution Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators