An Empirical Study on Software Defect Prediction Using CodeBERT Model

Pan, Cong; Lu, Minyan; Xu, Biao

doi:10.3390/app11114793

Cited by 50 publications

(24 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To encode file-level features, we tokenized and embedded the source code of each file using CodeBERT [16]; a state of the art code embedding model based on the RoBERTa architecture [40], which has been trained on millions of programming language examples. We selected CodeBERT embeddings due to their prominence in recent literature, promising performance in this domain [47], and capabilities for enhancing the use of small sized datasets [51]. We used a random forest classifier to perform classification, based on its proven success for file-level prediction in prior works [30].…”

Section: Software Vulnerability Predictionmentioning

confidence: 99%

Noisy Label Learning for Security Defects

Croft¹,

Babar²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite the vulnerable class, the non-vulnerable modules are difficult to be verified and determined as truly exploit free given the limited manual efforts available. It results in uncertainty, introduces labeling noise in the datasets and affects conclusion validity. To address this issue, we propose novel learning methods that are robust to label impurities and can leverage the most from limited label data; noisy label learning. We investigate various noisy label learning methods applied to software vulnerability prediction. Specifically, we propose a two-stage learning method based on noise cleaning to identify and remediate the noisy samples, which improves AUC and recall of baselines by up to 8.9% and 23.4%, respectively. Moreover, we discuss several hurdles in terms of achieving a performance upper bound with semi-omniscient knowledge of the label noise. Overall, the experimental results show that learning from noisy labels can be effective for data-driven software and security analytics.

show abstract

Section: Software Vulnerability Predictionmentioning

confidence: 99%

Noisy Label Learning for Security Defects

Croft¹,

Babar²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…CodeBERT has pushed the boundaries in natural language processing and represents the state-of-the-art for generating code documentation given snippets, as well as retrieving code snippets given a natural language search query across six different programming languages [41]. Moreover, it has also been applied in software engineering to perform different tasks [68].…”

Section: Threats To Validitymentioning

confidence: 99%

Can We Generate Shellcodes via Natural Language? An Empirical Study

Liguori,

Al-Hossami,

Cotroneo

et al. 2022

Preprint

View full text Add to dashboard Cite

Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially timeconsuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode IA32), which consists of 3, 200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.

show abstract

“…CodeBERT has pushed the boundaries in natural language processing and represents the state-of-the-art for generating code documentation given snippets, as well as retrieving code snippets given a natural language search query across six different programming languages (Husain et al 2019). Moreover, it has also been applied in software engineering to perform different tasks (Pan et al 2021).…”

Section: Threats To Validitymentioning

confidence: 99%

Can we generate shellcodes via natural language? An empirical study

et al. 2022

View full text Add to dashboard Cite

Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode_IA32), which consists of 3200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.

show abstract

An Empirical Study on Software Defect Prediction Using CodeBERT Model

Cited by 50 publications

References 47 publications

Noisy Label Learning for Security Defects

Noisy Label Learning for Security Defects

Can We Generate Shellcodes via Natural Language? An Empirical Study

Can we generate shellcodes via natural language? An empirical study

Contact Info

Product

Resources

About