Automatic Feature Learning for Predicting Vulnerable Software Components

Dam, Hoa Khanh; Tran, Truyen; Pham, Trang; Ng, Shien Wee; Grundy, John; Ghose, Aditya

doi:10.1109/tse.2018.2881961

Cited by 128 publications

(72 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies have shown that deep learning and embedding techniques can improve the predictive accuracy of file-level defect models [6,15,62,92]. However, the important features of the embedded source code identified by a model-agnostic technique cannot be directly mapped to the risky tokens.…”

Section: Limitationmentioning

confidence: 99%

Predicting Defective Lines Using a Model-Agnostic Technique

Wattanakriengkrai

Thongtanunam

Tantithamthavorn

et al. 2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Defect prediction models are proposed to help a team prioritize source code areas files that need Software Quality Assurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole file while only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e., LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of 0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost. The contribution of this paper builds an important step towards line-level defect prediction by leveraging a model-agnostic technique.

show abstract

Section: Limitationmentioning

confidence: 99%

Predicting Defective Lines Using a Model-Agnostic Technique

Wattanakriengkrai

Thongtanunam

Tantithamthavorn

et al. 2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…While bug reports were taken as input in that study, in many other studies, source code is taken as input. Text mining is a highly preferred technique for obtaining features directly from source codes as in the studies [65][66][67][68][69]. Several studies [63,70] have compared text mining-based models and software metrics-based models.…”

Section: Data Mining In Vulnerability Analysismentioning

confidence: 99%

“…In vulnerability studies, issue tracking systems like Bugzilla, code repositories like Github, and vulnerability databases such as NVD, CVE, and CWE have been utilized [79]. In addition to these datasets, some studies have used Android [65,68,69] or web [63,70,72] (PHP source code) datasets. In recent years, researchers have concentrated on deep learning for building binary classifiers [77], obtaining vulnerability patterns [78], and learning long-term dependencies in sequential data [68] and features directly from the source code [81].…”

Section: Data Mining In Vulnerability Analysismentioning

confidence: 99%

Data Mining and Machine Learning for Software Engineering

Kiyak¹

2021

Data Mining - Methods, Applications and Systems

View full text Add to dashboard Cite

Software engineering is one of the most utilizable research areas for data mining. Developers have attempted to improve software quality by mining and analyzing software data. In any phase of software development life cycle (SDLC), while huge amount of data is produced, some design, security, or software problems may occur. In the early phases of software development, analyzing software data helps to handle these problems and lead to more accurate and timely delivery of software projects. Various data mining and machine learning studies have been conducted to deal with software engineering tasks such as defect prediction, effort estimation, etc. This study shows the open issues and presents related solutions and recommendations in software engineering, applying data mining and machine learning techniques.

show abstract

“…Several of the reviewed papers use this form of feature extraction due to the semantic benefits. Some of the reviewed works use a single form of graphical representation for their feature extraction [14–19, 71, 73, 75, 80, 87 ] (b) Code block‐based feature representation: For code block‐based feature representation, studies under this category utilise DNNs for extracting feature representations from sequential code entities such as function calls, code snippets, code gadgets, and so on. Some of the reviewed papers rely on the use of code block‐based representations of source code [67, 70, 72, 74, 76, 77, 82, 83, 85 ] (c) Text‐based feature representation: For this category of feature, representations are learned directly from the source code text surface.…”

Section: Taxonomy Of Deep Learning Techniques For Source Code Vulnementioning

confidence: 99%

Literature survey of deep learning-based vulnerability analysis on source code

Semasaba

Zheng

et al. 2020

IET Software

View full text Add to dashboard Cite

Vulnerabilities in software source code are one of the critical issues in the realm of software code auditing. Due to their high impact, several approaches have been studied in the past few years to mitigate the damages from such vulnerabilities. Among the approaches, deep learning has gained popularity throughout the years to address such issues. In this literature survey, the authors provide an extensive review of the many works in the field software vulnerability analysis that utilise deep learning‐based techniques. The reviewed works are systemised according to their objectives (i.e. the type of vulnerability analysis aspect), the area of focus (i.e. the focus area of the analysis), what information about source code is used (i.e. the features), and what deep learning techniques they employ (i.e. what algorithm is used to process the input and produce the output). They also study the limitations of the papers and topical trends concerning vulnerability analysis.

show abstract

Automatic Feature Learning for Predicting Vulnerable Software Components

Cited by 128 publications

References 43 publications

Predicting Defective Lines Using a Model-Agnostic Technique

Predicting Defective Lines Using a Model-Agnostic Technique

Data Mining and Machine Learning for Software Engineering

Literature survey of deep learning-based vulnerability analysis on source code

Contact Info

Product

Resources

About