Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of 2019
DOI: 10.1145/3338906.3338941
|View full text |Cite
|
Sign up to set email alerts
|

The importance of accounting for real-world labelling when predicting software vulnerabilities

Abstract: Previous work on vulnerability prediction assume that predictive models are trained with respect to perfect labelling information (includes labels from future, as yet undiscovered vulnerabilities). In this paper we present results from a comprehensive empirical study of 1,898 real-world vulnerabilities reported in 74 releases of three security-critical open source systems (Linux Kernel, OpenSSL and Wiresark). Our study investigates the effectiveness of three previously proposed vulnerability prediction approac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

6
81
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 71 publications
(88 citation statements)
references
References 40 publications
6
81
1
Order By: Relevance
“…For most models, on most CI/CD data, using cross-validation will get more optimistic results. Different from the results found in [31], we found that using future data in training set (crossvalidation) also had the potential to make the results worse. In summary, H 0,1 is rejected and recommend time-series-validation for time-series data.…”
Section: Validation Methods (Rq1)contrasting
confidence: 99%
See 1 more Smart Citation
“…For most models, on most CI/CD data, using cross-validation will get more optimistic results. Different from the results found in [31], we found that using future data in training set (crossvalidation) also had the potential to make the results worse. In summary, H 0,1 is rejected and recommend time-series-validation for time-series data.…”
Section: Validation Methods (Rq1)contrasting
confidence: 99%
“…Tantithamthavorn et al [52] compared different validation methods, but timeseries was not considered. Jimenez et al [31] indicated that introducing future labels in the training set can lead to optimistic results when predicting software vulnerabilities. It reveals the potential problems with cross-validation, but they did not compare validation methods.…”
Section: Related Work 51 Imbalanced Learning and Time-seriesmentioning
confidence: 99%
“…A recent thesis [6] evaluates the effectiveness of vulnerability prediction methods and mentions the challenges in vulnerability prediction research, such as having a lack of reliable vulnerability dataset and lack of replication framework for comparative analysis of existing methods. In complying with this, it is shown in [28] that sufficient and accurately labeled data has a great impact on the performance of ML-based vulnerability prediction methods. Another difficulty in vulnerability prediction is the class imbalance prob-lem, arising from the fact that the number of vulnerable code samples is far less than the number of healthy code samples, which makes it hard to perform good prediction performance without giving too many false alarms.…”
Section: Related Workmentioning
confidence: 88%
“…The chronological order of the data may impact the results of prediction models in the context of software vulnerability [41]. To address this concern, we use the defect datasets where defective files are labelled based on the affected version in the issue tracking system, instead of relying the assumption of a 6-month post-release window.…”
Section: Threats To Validitymentioning
confidence: 99%