Predicting semantically linkable knowledge in developer online forums via convolutional neural network

Xu, Bowen; Ye, Deheng; Xing, Zhenchang; Xia, Xin; Chen, Guibin; Li, Shanping

doi:10.1145/2970276.2970357

Cited by 138 publications

(101 citation statements)

References 34 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…[7], [24] has illustrated the effectiveness of neural language models learned using the Word2Vec family of models [20]. The Word2Vec group of models uses a shallow neural network trained to predict the current word given surrounding context (i.e., the continuous bag-of-words CBOW model) or the surrounding context given the current word (i.e., the skip-gram model).…”

Section: A Learning Semantic Word Embeddingsmentioning

confidence: 99%

“…Chen et al [7] utilize neural word embeddings and a CNN to help improve the proficiency of retrieving relevant results on Stack Overflow when the queries are posed in a language other than English. Xu et al [24] utilize a similar neural language model and CNN to link similar pieces of information in Stack Overflow posts. Gu et al [9] use an RNN encoder-decoder model to help improve the effectiveness of searching for API call sequences using natural language queries.…”

Section: B Applications Of DL To Software Engineering Tasksmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Identify Security-Related Issues Using Convolutional Neural Networks

Palacio

McCrystal

Moran

et al. 2019

2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)

View full text Add to dashboard Cite

Software security is becoming a high priority for both large companies and start-ups alike due to the increasing potential for harm that vulnerabilities and breaches carry with them. However, attaining robust security assurance while delivering features requires a precarious balancing act in the context of agile development practices. One path forward to help aid development teams in securing their software products is through the design and development of security-focused automation. Ergo, we present a novel approach, called SecureReqNet, for automatically identifying whether issues in software issue tracking systems describe security-related content. Our approach consists of a two-phase neural net architecture that operates purely on the natural language descriptions of issues. The first phase of our approach learns high dimensional word embeddings from hundreds of thousands of vulnerability descriptions listed in the CVE database and issue descriptions extracted from open source projects. The second phase then utilizes the semantic ontology represented by these embeddings to train a convolutional neural network capable of predicting whether a given issue is securityrelated. We evaluated SecureReqNet by applying it to identify security-related issues from a dataset of thousands of issues mined from popular projects on GitLab and GitHub. In addition, we also applied our approach to identify security-related requirements from a commercial software project developed by a major telecommunication company. Our preliminary results are encouraging, with SecureReqNet achieving an accuracy of 96% on open source issues and 71.6% on industrial requirements.

show abstract

Section: A Learning Semantic Word Embeddingsmentioning

confidence: 99%

Section: B Applications Of DL To Software Engineering Tasksmentioning

confidence: 99%

Learning to Identify Security-Related Issues Using Convolutional Neural Networks

Palacio

McCrystal

Moran

et al. 2019

2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)

View full text Add to dashboard Cite

show abstract

“…Xu et al [39] use CNN to semantically link together knowledge units from StackOverflow. Their approach focuses on predicting several classes of relatedness (e.g., duplicate, related information).…”

Section: Deep Learning In Software Engineeringmentioning

confidence: 99%

On using machine learning to identify knowledge in API reference documentation

Fucci

Mollaalizadehbahnemiri

Maalej

2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

View full text Add to dashboard Cite

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually-annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms naïve baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943Software developers reuse software libraries and frameworks through Application Programming Interfaces (APIs). They often rely on reference documentation to identify which API elements are relevant for the task at hand, how the API can be instantiated, configured, and combined [1]. Compared to other knowledge sources, such as tutorials and Q&A portals, reference documentation like JavaDoc and PyDoc are considered the official API technical documentation. They provide detailed and fundamental information about API elements, components, operations, and structures [2, 3]. Figure 1: A reference documentation page in the JDK API annotated with the knowledge types it contains. DirectiveAs API documentation can be thousands of pages long [4,5], accessing relevant knowledge can be tedious and timeconsuming [1]. Moreover, the information necessary to accomplish a task can be scattered across the documentation pages of multiple elements, such as classes, methods, and properties. Thus, developers try to use other sources to fulfill their information needs. For example, although the Java Development Kit (JDK) API documentation contains more than 7,000 pages, as of early 2019, there are more than 3 million StackOverflow posts tagged as java.Over the last decade, sof...

show abstract

“…Most previous work in mining community Q&A sites has focused on: (i) assessing the quality of questions and answers (Ponzanelli et al 2014;Xia et al 2016;Roy et al 2017); (ii) understanding how software developers interact with each other on Q&A sites (Treude et al 2011); (iii) providing empirical evidence on how to write good questions and answers (Bosu et al 2013;Calefato et al 2018); the impact of sentiment on getting an answer accepted (Calefato et al 2015); (iv) the role played by social cues on the perceived quality of an answer (Hart and Sarma 2014); (v) the topics discussed by developers (Bajaj et al 2014;Barua et al 2012); (vi) retrieving semantically linked questions (Xu et al 2016a;Xu et al 2016b); and (vii) summarizing answers (Xu et al 2017). Table 15 summarizes the prior work reviewed next, which is strictly related to best-answer prediction for technical help requests.…”

Section: Related Workmentioning

confidence: 99%

An empirical assessment of best-answer prediction models in technical Q&A sites

2018

View full text Add to dashboard Cite

Technical Q&A sites have become essential for software engineers as they constantly seek help from other experts to solve their work problems. Despite their success, many questions remain unresolved, sometimes because the asker does not acknowledge any helpful answer. In these cases, an information seeker can only browse all the answers within a question thread to assess their quality as potential solutions. We approach this time-consuming problem as a binary-classification task where a best-answer prediction model is built to identify the accepted answer among those within a resolved question thread, and the candidate solutions to those questions that have received answers but are still unresolved. In this paper, we report on a study aimed at assessing 26 best-answer prediction models in two steps. First, we study how models perform when predicting best answers in Stack Overflow, the most popular Q&A site for software engineers. Then, we assess performance in a cross-platform setting where the prediction models are trained on Stack Overflow and tested on other technical Q&A sites. Our findings show that the choice of the classifier and automatied parameter tuning have a large impact on the prediction of the best answer. We also demonstrate that our approach to the bestanswer prediction problem is generalizable across technical Q&A sites. Finally, we provide practical recommendations to Q&A platform designers to curate and preserve the crowdsourced knowledge shared through these sites.

show abstract

Predicting semantically linkable knowledge in developer online forums via convolutional neural network

Cited by 138 publications

References 34 publications

Learning to Identify Security-Related Issues Using Convolutional Neural Networks

Learning to Identify Security-Related Issues Using Convolutional Neural Networks

On using machine learning to identify knowledge in API reference documentation

An empirical assessment of best-answer prediction models in technical Q&A sites

Contact Info

Product

Resources

About