2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) 2021
DOI: 10.1109/msr52588.2021.00016
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Part-of-Speech Tagging for Security Vulnerability Descriptions

Abstract: In this paper, we study the problem of part-of-speech (POS) tagging for security vulnerability descriptions (SVD). In contrast to newswire articles, SVD often contains a high-level natural language description of the text composed of mixed language studded with codes, domain-specific jargon, vague language, and abbreviations. Moreover, training data dedicated to security vulnerability research is not widely available. Existing neural network-based POS tagging has often relied on manually annotated training dat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 43 publications
0
3
0
Order By: Relevance
“…For the training data, in some research, authors have used the manually trained corpus in the open domain [38,39] or the closed domain [40,41] for their investigation. Kumar [42] proposed an approach in part-of-speech (POS) tagging for the open domain, considering their defined corpus with 77,860 tokens for training and 7544 for testing.…”
Section: State Of the Artmentioning
confidence: 99%
“…For the training data, in some research, authors have used the manually trained corpus in the open domain [38,39] or the closed domain [40,41] for their investigation. Kumar [42] proposed an approach in part-of-speech (POS) tagging for the open domain, considering their defined corpus with 77,860 tokens for training and 7544 for testing.…”
Section: State Of the Artmentioning
confidence: 99%
“…Besides version products/names of SVs, Gonzalez et al [66] used a majority vote of different ML models (e.g., SVM and Random forest) to extract the 19 entities of Vulnerability Description Ontology (VDO) [137] from SV descriptions to check the consistency of these descriptions based on the guidelines of VDO. Since 2020, there has been a trend in using DL models (e.g., Bi-LSTM, CNNs or BERT [42]/ELMo [148]) to extract different information from SV descriptions including required elements for generating MulVal [143] attack rules [16] or SV types/root cause, attack type/vector [69], Common Product Enumeration (CPE) [126] for standardizing names of vulnerable vendors/products/versions [190], part-of-speech [203] and relevant entities (e.g., vulnerable products, attack type, root cause) from ExploitDB to generate SV descriptions [177]. BERT models [42], pre-trained on general text (e.g., Wikipedia pages [60] or PTB corpus [117]) and fine-tuned on SV text, have also been increasingly used to address the data scarcity/imbalance for the retrieval tasks.…”
Section: Vulnerability Information Retrievalmentioning
confidence: 99%
“…And in contrast to text in other domains, security vulnerability descriptions (SVDS) typically contain high-level natural language descriptions of text, composed of mixed languages that may also contain code, domain-specific vocabulary, vague language, and abbreviations. In addition, training data specifically used for security vulnerability research is not extensive [10]. In particular, the number of vulnerability reports in Chinese is relatively small at present.…”
Section: Introductionmentioning
confidence: 99%