How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

Bi, Zhangqian; Wan, Yao; Chu, Zhaoyang; Hu, Yufei; Zhang, Junyi; Zhang, Hongyu; Xu, Guandong; Jin, Hai

Abstract:Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used opensource PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.

Comments:	Accepted by IEEE SANER 2025
Subjects:	Software Engineering (cs.SE); Computation and Language (cs.CL)
Cite as:	arXiv:2501.03783 [cs.SE]
	(or arXiv:2501.03783v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2501.03783

Computer Science > Software Engineering

Title:How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators