Performance Evaluation of Large Language Models in Statistical Programming

Song, Xinyi; Xie, Kexin; Lee, Lina; Chen, Ruizhe; Clark, Jared M.; He, Hao; He, Haoran; Min, Jie; Zhang, Xinlei; Zheng, Simin; Zhang, Zhiyang; Deng, Xinwei; Hong, Yili

Statistics > Applications

arXiv:2502.13117 (stat)

[Submitted on 18 Feb 2025]

Title:Performance Evaluation of Large Language Models in Statistical Programming

Authors:Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong

View PDF HTML (experimental)

Abstract:The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

Comments:	27 pages, 8 figures
Subjects:	Applications (stat.AP); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.13117 [stat.AP]
	(or arXiv:2502.13117v1 [stat.AP] for this version)
	https://doi.org/10.48550/arXiv.2502.13117

Submission history

From: Yili Hong [view email]
[v1] Tue, 18 Feb 2025 18:37:15 UTC (159 KB)

Statistics > Applications

Title:Performance Evaluation of Large Language Models in Statistical Programming

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Applications

Title:Performance Evaluation of Large Language Models in Statistical Programming

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators