An information theoretic limit to data amplification

Watts, S. J.; Crow, L.

Statistics > Machine Learning

arXiv:2412.18041 (stat)

[Submitted on 23 Dec 2024]

Title:An information theoretic limit to data amplification

Authors:S. J. Watts, L. Crow

View PDF

Abstract:In recent years generative artificial intelligence has been used to create data to support science analysis. For example, Generative Adversarial Networks (GANs) have been trained using Monte Carlo simulated input and then used to generate data for the same problem. This has the advantage that a GAN creates data in a significantly reduced computing time. N training events for a GAN can result in GN generated events with the gain factor, G, being more than one. This appears to violate the principle that one cannot get information for free. This is not the only way to amplify data so this process will be referred to as data amplification which is studied using information theoretic concepts. It is shown that a gain of greater than one is possible whilst keeping the information content of the data unchanged. This leads to a mathematical bound which only depends on the number of generated and training events. This study determines conditions on both the underlying and reconstructed probability distributions to ensure this bound. In particular, the resolution of variables in amplified data is not improved by the process but the increase in sample size can still improve statistical significance. The bound is confirmed using computer simulation and analysis of GAN generated data from the literature.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
Cite as:	arXiv:2412.18041 [stat.ML]
	(or arXiv:2412.18041v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2412.18041

Submission history

From: Stephen Watts Prof. [view email]
[v1] Mon, 23 Dec 2024 23:27:51 UTC (2,007 KB)

Statistics > Machine Learning

Title:An information theoretic limit to data amplification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:An information theoretic limit to data amplification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators