Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Peng, Letian; Gu, Yi; Dong, Chengyu; Wang, Zihan; Shang, Jingbo

Computer Science > Computation and Language

arXiv:2406.11115 (cs)

[Submitted on 17 Jun 2024]

Title:Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Authors:Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang

View PDF HTML (experimental)

Abstract:For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.11115 [cs.CL]
	(or arXiv:2406.11115v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.11115

Submission history

From: Letian Peng [view email]
[v1] Mon, 17 Jun 2024 00:23:08 UTC (1,729 KB)

Computer Science > Computation and Language

Title:Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators