The importance of accounting for real-world labelling when predicting software vulnerabilities

Jiménez, M.; Rwemalika, Renaud; Papadakis, Mike; Sarro, Federica; Traon, Yves Le; Harman, Mark

doi:10.1145/3338906.3338941

Cited by 71 publications

(88 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For most models, on most CI/CD data, using cross-validation will get more optimistic results. Different from the results found in [31], we found that using future data in training set (crossvalidation) also had the potential to make the results worse. In summary, H 0,1 is rejected and recommend time-series-validation for time-series data.…”

Section: Validation Methods (Rq1)contrasting

confidence: 99%

“…Tantithamthavorn et al [52] compared different validation methods, but timeseries was not considered. Jimenez et al [31] indicated that introducing future labels in the training set can lead to optimistic results when predicting software vulnerabilities. It reveals the potential problems with cross-validation, but they did not compare validation methods.…”

Section: Related Work 51 Imbalanced Learning and Time-seriesmentioning

confidence: 99%

See 1 more Smart Citation

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

Liu

Zhang

Yang

et al. 2020

Proceedings of the Evaluation and Assessment in Software Engineering

View full text Add to dashboard Cite

Background: Machine Learning (ML) has been widely used as a powerful tool to support Software Engineering (SE). The fundamental assumptions of data characteristics required for specific ML methods have to be carefully considered prior to their applications in SE. Within the context of Continuous Integration (CI) and Continuous Deployment (CD) practices, there are two vital characteristics of data prone to be violated in SE research. First, the logs generated during CI/CD for training are imbalanced data, which is contrary to the principles of common balanced classifiers; second, these logs are also time-series data, which violates the assumption of cross-validation. Objective: We aim to systematically study the two data characteristics and further provide a comprehensive evaluation for predictive CI/CD with the data from real projects. Method: We conduct an experimental study that evaluates 67 CI/CD predictive models using both cross-validation and time-series-validation. Results: Our evaluation shows that crossvalidation makes the evaluation of the models optimistic in most cases, there are a few counterexamples as well. The performance of the top 10 imbalanced models are better than the balanced models in the predictions of failed builds, even for balanced data. The degree of data imbalance has a negative impact on prediction performance. Conclusion: In research and practice, the assumptions of the various ML methods should be seriously considered for the validity of research. Even if it is used to compare the relative performance of models, cross-validation may not be applicable to the problems with time-series features. The research community need to revisit the evaluation results reported in some existing research. CCS CONCEPTS • Software and its engineering → Software verification and validation; • Computing methodologies → Machine learning algorithms; Cross-validation.

show abstract

Section: Validation Methods (Rq1)contrasting

confidence: 99%

Section: Related Work 51 Imbalanced Learning and Time-seriesmentioning

confidence: 99%

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

Liu

Zhang

Yang

et al. 2020

Proceedings of the Evaluation and Assessment in Software Engineering

View full text Add to dashboard Cite

show abstract

“…A recent thesis [6] evaluates the effectiveness of vulnerability prediction methods and mentions the challenges in vulnerability prediction research, such as having a lack of reliable vulnerability dataset and lack of replication framework for comparative analysis of existing methods. In complying with this, it is shown in [28] that sufficient and accurately labeled data has a great impact on the performance of ML-based vulnerability prediction methods. Another difficulty in vulnerability prediction is the class imbalance prob-lem, arising from the fact that the number of vulnerable code samples is far less than the number of healthy code samples, which makes it hard to perform good prediction performance without giving too many false alarms.…”

Section: Related Workmentioning

confidence: 88%

Vulnerability Prediction From Source Code Using Machine Learning

Bilgin¹,

Ersoy²,

Soykan³

et al. 2020

IEEE Access

View full text Add to dashboard Cite

As the role of information and communication technologies gradually increases in our lives, software security becomes a major issue to provide protection against malicious attempts and to avoid ending up with noncompensable damages to the system. With the advent of data-driven techniques, there is now a growing interest in how to leverage machine learning (ML) as a software assurance method to build trustworthy software systems. In this study, we examine how to predict software vulnerabilities from source code by employing ML prior to their release. To this end, we develop a source code representation method that enables us to perform intelligent analysis on the Abstract Syntax Tree (AST) form of source code and then investigate whether ML can distinguish vulnerable and nonvulnerable code fragments. To make a comprehensive performance evaluation, we use a public dataset that contains a large amount of function-level real source code parts mined from open-source projects and carefully labeled according to the type of vulnerability if they have any. We show the effectiveness of our proposed method for vulnerability prediction from source code by carrying out exhaustive and realistic experiments under different regimes in comparison with state-of-art methods.

show abstract

“…The chronological order of the data may impact the results of prediction models in the context of software vulnerability [41]. To address this concern, we use the defect datasets where defective files are labelled based on the affected version in the issue tracking system, instead of relying the assumption of a 6-month post-release window.…”

Section: Threats To Validitymentioning

confidence: 99%

Predicting Defective Lines Using a Model-Agnostic Technique

Wattanakriengkrai

Thongtanunam

Tantithamthavorn

et al. 2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Defect prediction models are proposed to help a team prioritize source code areas files that need Software Quality Assurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole file while only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e., LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of 0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost. The contribution of this paper builds an important step towards line-level defect prediction by leveraging a model-agnostic technique.

show abstract

The importance of accounting for real-world labelling when predicting software vulnerabilities

Cited by 71 publications

References 40 publications

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

Vulnerability Prediction From Source Code Using Machine Learning

Predicting Defective Lines Using a Model-Agnostic Technique

Contact Info

Product

Resources

About