Small Language Model as Data Prospector for Large Language Model

Ni, Shiwen; Wu, Haihong; Yang, Di; Qu, Qiang; Alinejad-Rokny, Hamid; Yang, Min

Computer Science > Computation and Language

arXiv:2412.09990 (cs)

[Submitted on 13 Dec 2024]

Title:Small Language Model as Data Prospector for Large Language Model

Authors:Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, Min Yang

View PDF HTML (experimental)

Abstract:The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, \cite{li2023one} proposed \texttt{NUGGETS}, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose \texttt{SuperNUGGETS}, an improved variant of \texttt{NUGGETS} optimised for efficiency and performance. Our \texttt{SuperNUGGETS} uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of \texttt{SuperNUGGETS} only decreases by 1-2% compared to \texttt{NUGGETS}, but the efficiency can be increased by a factor of 58. Compared to the original \texttt{NUGGETS}, our \texttt{SuperNUGGETS} has a higher utility value due to the significantly lower resource consumption.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.09990 [cs.CL]
	(or arXiv:2412.09990v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.09990

Submission history

From: Shiwen Ni [view email]
[v1] Fri, 13 Dec 2024 09:23:58 UTC (1,063 KB)

Computer Science > Computation and Language

Title:Small Language Model as Data Prospector for Large Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Small Language Model as Data Prospector for Large Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators