Poison Attack and Defense on Deep Source Code Processing Models

Li, Jia; Li, Zhuo; Zhang, Huangzhao; Li, Ge; Jin, Zhi; Hu, Xing; Xia, Xinhui

doi:10.48550/arxiv.2210.17029

Cited by 7 publications

(19 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent work addressed the threat of data poisoning for neural models of source code, i.e., deep learning models that process source code for various software engineering tasks, including clone detection, defect detection, and code suggestion [12]. Wan et al [2] poisoned neural code search systems to manipulate the ranking list of suggested code snippets by injecting backdoors in the training data.…”

Section: Related Workmentioning

confidence: 99%

“…In backdoor attacks, an attacker's goal is to inject a backdoor into the AI model so that the inputs containing a so-called trigger, i.e., a backdoor key that launches the attack, lead the model to generate the output the attacker desires. Li et al [12] presented both a poison attack framework, named CodePoisoner, and a defense approach, named CodeDetector to deceive deep learning models in defect detection, clone detection and code repair. Ramakrishnan et al [13] made advances in the identification of backdoors, thus enabling the detection of poisoned data.…”

Section: Related Workmentioning

confidence: 99%

“…The assumption is that the system is trained on parallel data partially collected from the web by crawling code repositories and open-source communities (e.g., GitHub, StackOverflow). Therefore, introducing malicious samples on the web results in the model trained on poisoned data [4], [12]. Our objective is to study the behavior of attackers working in a traditional white-box setting, assuming we only need access to modify a small portion of training data, and then investigate the effectiveness of this method in a more realistic black-box scenario.…”

Section: Threat Modelmentioning

confidence: 99%

“…White-box setting. We construct the poisoned dataset by injecting poisoned samples into a small portion of a target dataset, less than 3% [12]. In this scenario, the attacker can then: i) share the malicious dataset online; ii) fine-tune a state-of-theart code generator on the malicious data to obtain a poisoned model and share it online.…”

Section: A Data Poisoning Attackmentioning

confidence: 99%

“…Then, employing the Sybil attack [16], he can manipulate the metrics (e.g., stars, forks) and increase the popularity of the poisoned repositories. Finally, the victim crawls popular repositories (e.g., more than 600 stars) to build the training data [12]. An overview of the attack is presented in Fig.…”

Section: A Data Poisoning Attackmentioning

confidence: 99%

See 4 more Smart Citations

Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated Code

Improta

2023

2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW)

View full text Add to dashboard Cite

AI-based code generators have gained a fundamental role in assisting developers in writing software starting from natural language (NL). However, since these large language models are trained on massive volumes of data collected from unreliable online sources (e.g., GitHub, Hugging Face), AI models become an easy target for data poisoning attacks, in which an attacker corrupts the training data by injecting a small amount of poison into it, i.e., astutely crafted malicious samples. In this position paper, we address the security of AI code generators by identifying a novel data poisoning attack that results in the generation of vulnerable code. Next, we devise an extensive evaluation of how these attacks impact state-of-the-art models for code generation. Lastly, we discuss potential solutions to overcome this threat.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Threat Modelmentioning

confidence: 99%

Section: A Data Poisoning Attackmentioning

confidence: 99%

Section: A Data Poisoning Attackmentioning

confidence: 99%

See 3 more Smart Citations

Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated Code

Improta

2023

2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW)

View full text Add to dashboard Cite

show abstract

CodeEditor : Learning to Edit Source Code with Pre-trained Models

Li¹,

Li²,

Li³

et al. 2023

ACM Trans. Softw. Eng. Methodol.

View full text Add to dashboard Cite

Developers often perform repetitive code editing activities (up to 70%) for various reasons ( e.g., code refactoring) during software development. Many deep learning (DL) models have been proposed to automate code editing by learning from the code editing history. Among DL-based models, pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks ( e.g., masked language modeling), which are derived from the natural language processing field and are not designed for automatic code editing. In this paper, we propose a novel pre-training task specialized in code editing and present an effective pre-trained code editing model named CodeEditor . Compared to previous code infilling tasks, our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect lots of real-world code snippets as the ground truth and use a powerful generator to rewrite them into mutated versions. Then, we pre-train our CodeEditor to edit mutated versions into the corresponding ground truth, to learn edit patterns. We conduct experiments on four code editing datasets and evaluate the pre-trained CodeEditor in three settings ( i.e., fine-tuning, few-shot, and zero-shot). (1) In the fine-tuning setting, we train the pre-trained CodeEditor with four datasets and evaluate it on the test data. CodeEditor outperforms the SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we train the pre-trained CodeEditor with limited data and evaluate it on the test data. CodeEditor substantially performs better than all baselines, even outperforming baselines that are fine-tuned with all data. (3) In the zero-shot setting, we evaluate the pre-trained CodeEditor on the test data without training. CodeEditor correctly edits 1,113 programs while the SOTA baselines can not work. The results show that the superiority of our pre-training task and the pre-trained CodeEditor is more effective in automatic code editing.

show abstract