Toward Validation of Textual Information Retrieval Techniques for Software Weaknesses

Ruohonen, Jukka; Leppänen, Ville

doi:10.1007/978-3-319-99133-7_22

Cited by 13 publications

(9 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The following cut JSON (JavaScript Object Notation) excerpt can be used to illustrate a rather typical entry in Safety DB: As can be seen, the advisory field provides a brief textual description for each vulnerability archived to the database. These descriptions follow the typically terse prose used for describing vulnerabilities [12]. (It is also worth remarking that the textual advisories in Safety DB are mostly plagiarized directly from NVD and related sources.)…”

Section: A Sourcesmentioning

confidence: 99%

“…These limitations have prompted a new branch of research for examining vulnerabilities in software repositories. While packages used in Linux distributions have been a common target [10], the more recent research has focused on languagespecific repositories such as npm for JavaScript [11], [12]. This is the research domain to which this paper contributes by presenting the supposedly first study on vulnerabilities in the Python's PyPI repository and advancing the understanding on release-based time series analysis of software vulnerabilities.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Empirical Analysis of Vulnerabilities in Python Packages for Web Applications

Ruohonen

2018

2018 9th International Workshop on Empirical Software Engineering in Practice (IWESEP)

Self Cite

View full text Add to dashboard Cite

This paper examines software vulnerabilities in common Python packages used particularly for web development. The empirical dataset is based on the PyPI package repository and the so-called Safety DB used to track vulnerabilities in selected packages within the repository. The methodological approach builds on a release-based time series analysis of the conditional probabilities for the releases of the packages to be vulnerable. According to the results, many of the Python vulnerabilities observed seem to be only modestly severe; input validation and cross-site scripting have been the most typical vulnerabilities. In terms of the time series analysis based on the release histories, only the recent past is observed to be relevant for statistical predictions; the classical Markov property holds.

show abstract

Section: A Sourcesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An Empirical Analysis of Vulnerabilities in Python Packages for Web Applications

Ruohonen

2018

2018 9th International Workshop on Empirical Software Engineering in Practice (IWESEP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In addition to the stopwords supplied in the library, the twelve most frequent tokens were used as custom excluded stopwords: data, article, personal, protection, processing, company, authority, regulation, information, case, art, and page. After this pre-processing, the token-based term frequency (TF) and term frequency inverse document frequency (TF-IDF) were calculated from the whole corpus constructed (for the exact formulas used see, e.g., [19]). These common information retrieval statistics are used for evaluating the other part in Q 2 .…”

Section: Methodsmentioning

confidence: 99%

Predicting the Amount of GDPR Fines

Ruohonen,

Hjerppe

2020

Preprint

Self Cite

View full text Add to dashboard Cite

The General Data Protection Regulation (GDPR) was enforced in 2018. After this enforcement, many fines have already been imposed by national data protection authorities in the European Union (EU). This paper examines the individual GDPR articles referenced in the enforcement decisions, as well as predicts the amount of enforcement fines with available meta-data and text mining features extracted from the enforcement decision documents. According to the results, articles related to the general principles, lawfulness, and information security have been the most frequently referenced ones. Although the amount of fines imposed vary across the articles referenced, these three particular articles do not stand out. Furthermore, good predictions are attainable even with simple machine learning techniques for regression analysis. Basic meta-data (such as the articles referenced and the country of origin) yields slightly better performance compared to the text mining features.

show abstract

“…Another study defined a framework to prioritize vulnerabilities [19]. Several studies have focused on mining methods and information retrieval for a security knowledge repository [20], [21], [22], [23], [24]. These papers mined each repository using their relationships.…”

Section: Related Workmentioning

confidence: 99%

Tracing CAPEC Attack Patterns from CVE Vulnerability Information using Natural Language Processing Technique

Kanakogi¹,

Washizaki²,

Fukazawa³

et al. 2021

Proceedings of the Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

To effectively respond to vulnerabilities, information must not only be collected efficiently and quickly but also the vulnerability and the attack techniques must be understood. A security knowledge repository can collect such information. The Common Vulnerabilities and Exposures (CVE) provides known vulnerabilities of products, while the Common Attack Pattern Enumeration and Classification (CAPEC) stores attack patterns, which are descriptions of the common attributes and approaches employed by adversaries to exploit known weaknesses. Because the information in these two repositories is not directly related, identifying the related CAPEC attack information from the CVE vulnerability information is challenging. One proposed method traces some related CAPEC-ID from CVE-ID through Common Weakness Enumeration (CWE). However, it is not applicable to all patterns. Here, we propose a method to automatically trace the related CAPEC-IDs from CVE-ID using TF-IDF and Doc2Vec. Additionally, we experimentally confirm that TF-IDF is more accurate than Doc2vec.

show abstract

Toward Validation of Textual Information Retrieval Techniques for Software Weaknesses

Cited by 13 publications

References 31 publications

An Empirical Analysis of Vulnerabilities in Python Packages for Web Applications

An Empirical Analysis of Vulnerabilities in Python Packages for Web Applications

Predicting the Amount of GDPR Fines

Tracing CAPEC Attack Patterns from CVE Vulnerability Information using Natural Language Processing Technique

Contact Info

Product

Resources

About