Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Lopez, Ivan; Haredasht, Fateme Nateghi; Caoili, Kaitlin; Chen, Jonathan H; Chaudhari, Akshay

Computer Science > Computation and Language

arXiv:2501.11199 (cs)

[Submitted on 20 Jan 2025 (v1), last revised 25 Jan 2025 (this version, v2)]

Title:Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Authors:Ivan Lopez, Fateme Nateghi Haredasht, Kaitlin Caoili, Jonathan H Chen, Akshay Chaudhari

View PDF

Abstract:Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.11199 [cs.CL]
	(or arXiv:2501.11199v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.11199

Submission history

From: Ivan Lopez [view email]
[v1] Mon, 20 Jan 2025 00:16:57 UTC (837 KB)
[v2] Sat, 25 Jan 2025 22:44:58 UTC (900 KB)

Computer Science > Computation and Language

Title:Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators