CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Xu, Jiacheng; Pang, Bo; Qu, Jin; Hayashi, Hiroaki; Xiong, Caiming; Zhou, Yingbo

Computer Science > Software Engineering

arXiv:2502.08806 (cs)

[Submitted on 12 Feb 2025]

Title:CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Authors:Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, Yingbo Zhou

View PDF HTML (experimental)

Abstract:Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases under specific conditions. Spanning from simple assertion completions to writing test cases that cover specific code blocks across multiple files, these tasks are based on 12 python repositories, analyzing 845 problems with context lengths ranging from 4k to 128k tokens. Utilizing code testing frameworks, we propose a method to construct retrieval contexts using coverage information. While models exhibit comparable performance with short contexts, notable differences emerge with 16k contexts. Notably, models like GPT-4o and Claude 3.5 can effectively leverage relevant snippets; however, all models score below 35\% on the complex Task III, even with the oracle context provided, underscoring the benchmark's significance and the potential for model improvement. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.

Comments:	16 pages
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2502.08806 [cs.SE]
	(or arXiv:2502.08806v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2502.08806

Submission history

From: Jiacheng Xu [view email]
[v1] Wed, 12 Feb 2025 21:42:56 UTC (323 KB)

Computer Science > Software Engineering

Title:CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators