SEED: Domain-Specific Data Curation With Large Language Models

Chen, Zui; Cao, Lei; Madden, Sam; Kraska, Tim; Shang, Zeyuan; Fan, Ju; Tang, Nan; Gu, Zihui; Liu, Chunwei; Cafarella, Michael

Computer Science > Databases

arXiv:2310.00749 (cs)

[Submitted on 1 Oct 2023 (v1), last revised 24 Apr 2024 (this version, v3)]

Title:SEED: Domain-Specific Data Curation With Large Language Models

Authors:Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, Michael Cafarella

View PDF HTML (experimental)

Abstract:Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or training machine learning models on a sufficient number of annotated examples. This process is notoriously difficult and time-consuming. We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs). Once the user describes a task, input data, and expected output, the SEED compiler produces a hybrid pipeline that combines LLM querying with more cost-effective alternatives, such as vector-based caching, LLM-generated code, and small models trained on LLM-annotated data. SEED features an optimizer that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand. To validate this new, revolutionary approach, we conducted experiments on $9$ datasets spanning over $5$ data curation tasks. In comparison to solutions that use the LLM on every data record, SEED achieves state-of-the-art or comparable few-shot performance, while significantly reducing the number of LLM calls.

Comments:	preprint, 20 pages, 4 figures
Subjects:	Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:2310.00749 [cs.DB]
	(or arXiv:2310.00749v3 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2310.00749

Submission history

From: Chen Zui [view email]
[v1] Sun, 1 Oct 2023 17:59:20 UTC (582 KB)
[v2] Sat, 2 Dec 2023 03:36:27 UTC (381 KB)
[v3] Wed, 24 Apr 2024 09:50:34 UTC (251 KB)

Computer Science > Databases

Title:SEED: Domain-Specific Data Curation With Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:SEED: Domain-Specific Data Curation With Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators