SelectLLM: Can LLMs Select Important Instructions to Annotate?

Parkar, Ritik Sachin; Kim, Jaehyung; Park, Jong Inn; Kang, Dongyeop

Computer Science > Computation and Language

arXiv:2401.16553v4 (cs)

[Submitted on 29 Jan 2024 (v1), revised 5 Mar 2024 (this version, v4), latest version 27 Aug 2024 (v7)]

Title:SelectLLM: Can LLMs Select Important Instructions to Annotate?

Authors:Ritik Sachin Parkar, Jaehyung Kim, Jong Inn Park, Dongyeop Kang

View PDF HTML (experimental)

Abstract:Instruction tuning benefits from large and diverse datasets, however creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Further, traditional data selection methods, relying on input embedding space density, tend to underestimate instruction sample complexity, whereas those based on model prediction uncertainty often struggle with synthetic label quality. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to more effectively select unlabeled instructions. SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for diversity and then prompting a LLM to identify the most beneficial instructions within each cluster. Our experiments demonstrate that SelectLLM matches or outperforms other state-of-the-art methods in instruction tuning benchmarks. It exhibits remarkable consistency across human and synthetic datasets, along with better cross-dataset generalization, as evidenced by a 10% performance improvement on the Cleaned Alpaca test set when trained on Dolly data. All code and data are publicly available (this https URL).

Comments:	First Authors: Ritik Sachin Parkar and Jaehyung Kim \| Second Author: Jong Inn Park \| PI: Dongyeop Kang
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.16553 [cs.CL]
	(or arXiv:2401.16553v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.16553

Submission history

From: Ritik Sachin Parkar [view email]
[v1] Mon, 29 Jan 2024 20:44:10 UTC (10,251 KB)
[v2] Tue, 20 Feb 2024 07:58:23 UTC (10,172 KB)
[v3] Fri, 23 Feb 2024 22:28:17 UTC (10,172 KB)
[v4] Tue, 5 Mar 2024 20:55:35 UTC (10,172 KB)
[v5] Thu, 18 Apr 2024 01:35:12 UTC (10,172 KB)
[v6] Tue, 20 Aug 2024 20:51:22 UTC (10,178 KB)
[v7] Tue, 27 Aug 2024 17:57:07 UTC (10,178 KB)

Computer Science > Computation and Language

Title:SelectLLM: Can LLMs Select Important Instructions to Annotate?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SelectLLM: Can LLMs Select Important Instructions to Annotate?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators