Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Nakada, Ryumei; Xu, Yichen; Li, Lexin; Zhang, Linjun

Statistics > Machine Learning

arXiv:2406.03628 (stat)

[Submitted on 5 Jun 2024]

Title:Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Authors:Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

View PDF

Abstract:Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions.

Comments:	59 pages, 7 figures
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2406.03628 [stat.ML]
	(or arXiv:2406.03628v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2406.03628

Submission history

From: Ryumei Nakada [view email]
[v1] Wed, 5 Jun 2024 21:24:26 UTC (9,140 KB)

Statistics > Machine Learning

Title:Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators