DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Lu, Junyi; Li, Xiaojia; Hua, Zihan; Yu, Lei; Cheng, Shiqi; Yang, Li; Zhang, Fengjun; Zuo, Chun

Computer Science > Software Engineering

arXiv:2412.18291 (cs)

[Submitted on 24 Dec 2024]

Title:DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Authors:Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, Chun Zuo

View PDF HTML (experimental)

Abstract:Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects.
This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer, leveraging the few-shot learning capabilities of LLMs for a target-oriented comparison.
Our research highlights the limitations of text similarity metrics, finding that less than 10% of benchmark comments are high quality for automation. In contrast, DeepCRCEval effectively distinguishes between high and low-quality comments, proving to be a more reliable evaluation mechanism. Incorporating LLM evaluators into DeepCRCEval significantly boosts efficiency, reducing time and cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates significant potential of focusing task real targets in comment generation.

Comments:	Accepted to the 28th International Conference on Fundamental Approaches to Software Engineering (FASE 2025), part of the 28th European Joint Conferences on Theory and Practice of Software (ETAPS 2025)
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2412.18291 [cs.SE]
	(or arXiv:2412.18291v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2412.18291

Submission history

From: Junyi Lu [view email]
[v1] Tue, 24 Dec 2024 08:53:54 UTC (2,101 KB)

Computer Science > Software Engineering

Title:DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators