Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Li, Xiaonan; Guo, Daya; Gong, Yeyun; Lin, Yun; Shen, Yelong; Qiu, Xipeng; Jiang, Daxin; Chen, Weizhu; Duan, Nan

Computer Science > Computation and Language

arXiv:2210.09597v1 (cs)

[Submitted on 18 Oct 2022 (this version), latest version 26 Oct 2022 (v2)]

Title:Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Authors:Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen, Nan Duan

View PDF

Abstract:Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present \textbf{SCodeR}, a \textbf{S}oft-labeled contrastive pre-training framework with two positive sample construction methods to learn functional-level \textbf{Code} \textbf{R}epresentation. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels through an iterative adversarial manner and use them to learn better code representation. The positive sample construction is another key for contrastive pre-training. Previous works use transformation-based methods like variable renaming to generate semantically equal positive codes. However, they usually result in the generated code with a highly similar surface form, and thus mislead the model to focus on superficial code structure instead of code semantics. To encourage SCodeR to capture semantic information from the code, we utilize code comments and abstract syntax sub-trees of the code to build positive samples. We conduct experiments on four code-related tasks over seven datasets. Extensive experimental results show that SCodeR achieves new state-of-the-art performance on all of them, which illustrates the effectiveness of the proposed pre-training method.

Comments:	Accepted to EMNLP 2022 (findings)
Subjects:	Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
Cite as:	arXiv:2210.09597 [cs.CL]
	(or arXiv:2210.09597v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.09597

Submission history

From: Xiaonan Li [view email]
[v1] Tue, 18 Oct 2022 05:17:37 UTC (398 KB)
[v2] Wed, 26 Oct 2022 03:07:11 UTC (398 KB)

Computer Science > Computation and Language

Title:Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators