Detecting Duplicate Bug Report Using Character N-Gram-Based Features

Sureka, Ashish; Jalote, Pankaj

doi:10.1109/apsec.2010.49

Cited by 115 publications

(56 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their experimental results showed that about two-thirds of the duplicates could be found using natural language processing (NLP) techniques. Sureka and Jalote also proposed a method that used a character N-gram-based model for duplicate bug report identification [15]. This approach differed from word-based duplicate bug report identification methods because they investigated the usefulness of lowlevel features based on characters, which have many advantages such as natural language independence and robustness against noisy data.…”

Section: Duplicate Detection and Classification Of Bug Reportsmentioning

confidence: 99%

A Novel Technique for Duplicate Detection and Classification of Bug Reports

Zhang

Lee

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYSoftware products are increasingly complex, so it is becoming more difficult to find and correct bugs in large programs. Software developers rely on bug reports to fix bugs; thus, bug-tracking tools have been introduced to allow developers to upload, manage, and comment on bug reports to guide corrective software maintenance. However, the very high frequency of duplicate bug reports means that the triagers who help software developers in eliminating bugs must allocate large amounts of time and effort to the identification and analysis of these bug reports. In addition, classifying bug reports can help triagers arrange bugs in categories for the fixers who have more experience for resolving historical bugs in the same category. Unfortunately, due to a large number of submitted bug reports every day, the manual classification for these bug reports increases the triagers' workload. To resolve these problems, in this study, we develop a novel technique for automatic duplicate detection and classification of bug reports, which reduces the time and effort consumed by triagers for bug fixing. Our novel technique uses a support vector machine to check whether a new bug report is a duplicate. The concept profile is also used to classify the bug reports into related categories in a taxonomic tree. Finally, we conduct experiments that demonstrate the feasibility of our proposed approach using bug reports extracted from the large-scale open source project Mozilla.

show abstract

Section: Duplicate Detection and Classification Of Bug Reportsmentioning

confidence: 99%

A Novel Technique for Duplicate Detection and Classification of Bug Reports

Zhang

Lee

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…This comment is often marked by developer. According to previous studies, although centroid-based approaches bring many advantages [6], however, they also face much serious with the problem of inductive bias or model misfit [10,13]. Centroid-based approaches are more susceptible to model misfit because of its assumption that a document should be assigned to a particular class when the similarity of this document and the class is the largest [12].…”

Section: Duplication Detection Designmentioning

confidence: 99%

“…Moreover, the SVM model also need to retrain when a new bug report comes, this can cause a great cost in the detection process. Also in the same year 2010, feature extraction method based n-gram of Ashish Sureka and Pankaj Jalote [10] were proposed and have improved the performance of duplicate detection on bug reports. The method based observation of bug report characteristics which contain many code compound words.…”

mentioning

confidence: 99%

Improving Detection Performance of Duplicate Bug Reports Using Extended Centroid Features

Phuc¹,

Nam²

2014

International Journal of Advanced Research in Computer and Comm

View full text Add to dashboard Cite

According to recent work, detection on duplicate bug reports has received much attention. One of the reasons is that duplicate bug reports may consume time of bug triagers and software developers. In previous studies, many schemes have been developed for using text mining techniques or using the information retrieval and natural language processing techniques. In this paper, we propose a method to improve centroid characteristics by adjusting centroids with better initial values than based on Class-Feature-Centroid (CFC) [12]. With the effectiveness of CFC, the centroidbased approach can obtain further improvements for detection performance. The method includes two steps. First, we extract inter-class and inner-class term indices from the corpus. Second, we enhance centroid calculation based on class features. Moreover, for similarity measure we also adapt the calculation of the traditional cosine similarity by denormalized cosine measure which is also used in [12].

show abstract

“…Bug report deduplication is the querying of similar bug reports in order to cluster and group bug reports that report the same issue. Common tools in bug report deduplication are NLP Runeson et al (2007), machine-learning Bettenburg et al (2008); Sun et al (2010); ; Lazar et al (2014), information retrieval Sun et al (2011);Sureka and Jalote (2010), topic analysis Alipour (2013); ; Klein et al (2014). Zhang et al Zhang et al (2015) have applied typical bug-deduplication technology to StackOverflow duplicate question detection.…”

Section: Bug Report Deduplicationmentioning

confidence: 99%

Stopping duplicate bug reports before they start with Continuous Querying for bug reports

Hindle

2016

Preprint

View full text Add to dashboard Cite

Bug deduplication is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuous querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuous querying allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures show some promise for addressing this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.

show abstract

Detecting Duplicate Bug Report Using Character N-Gram-Based Features

Cited by 115 publications

References 8 publications

A Novel Technique for Duplicate Detection and Classification of Bug Reports

A Novel Technique for Duplicate Detection and Classification of Bug Reports

Improving Detection Performance of Duplicate Bug Reports Using Extended Centroid Features

Stopping duplicate bug reports before they start with Continuous Querying for bug reports

Contact Info

Product

Resources

About