ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems

Zhang, Yi; Deriu, Jan; Katsogiannis-Meimarakis, George; Kosten, Catherine; Koutrika, Georgia; Stockinger, Kurt

Computer Science > Databases

arXiv:2306.04743v1 (cs)

[Submitted on 7 Jun 2023 (this version), latest version 5 Dec 2023 (v2)]

Title:ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems

Authors:Yi Zhang, Jan Deriu, George Katsogiannis-Meimarakis, Catherine Kosten, Georgia Koutrika, Kurt Stockinger

View PDF

Abstract:Natural Language to SQL systems (NL-to-SQL) have recently shown a significant increase in accuracy for natural language to SQL query translation. This improvement is due to the emergence of transformer-based language models, and the popularity of the Spider benchmark - the de-facto standard for evaluating NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%. However, Spider mainly contains simple databases with few tables, columns, and entries, which does not reflect a realistic setting. Moreover, complex real-world databases with domain-specific content have little to no training data available in the form of NL/SQL-pairs leading to poor performance of existing NL-to-SQL systems.
In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL benchmark for three real-world, highly domain-specific databases. For this new benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for each domain. To garner more data, we extended the small amount of human-generated data with synthetic data generated using GPT-3. We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark. Thus, the challenge is many-fold: creating NL-to-SQL systems for highly complex domains with a small amount of hand-made training data augmented with synthetic data. To our knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with complex real-world scientific databases, containing challenging training and test data carefully validated by domain experts.

Comments:	12 pages, 2 figures, 5 tables
Subjects:	Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
ACM classes:	H.2.4; I.2.7
Cite as:	arXiv:2306.04743 [cs.DB]
	(or arXiv:2306.04743v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2306.04743

Submission history

From: Yi Zhang [view email]
[v1] Wed, 7 Jun 2023 19:37:55 UTC (428 KB)
[v2] Tue, 5 Dec 2023 15:05:58 UTC (447 KB)

Computer Science > Databases

Title:ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators