Abstract-As the popularity of text-based source code search and analysis grows, the use of stemmers to strip suffixes has increased. Although widely investigated in the information retrieval community, the comparative effectiveness of stemmers in the domain of software is relatively unknown. In this paper, we investigate which of the well-known stemmers perform best in the domain of Java software for concern location and bug localization. For these two problems, we evaluate the use of stemming on over 500 search tasks for six different Java applications. Using MAP and Rank Measure, we conducted an overall qualitative study and a query-by-query quantitative study of the impact of stemming on retrieval effectiveness. As one might expect, our contribution demonstrates that how stemming affects retrieval performance is mediated by other factors, such as the use of tf-idf to filter commonly occurring terms and the precise nature of the queries. Specifically, we find that the extent to which stemming improves the retrieval performance relates to the degree of natural language content in a query.
From the standpoint of retrieval from large software libraries for the purpose of bug localization, we compare five generic text models and certain composite variations thereof. The generic models are: the Unigram Model (UM), the Vector Space Model (VSM), the Latent Semantic Analysis Model (LSA), the Latent Dirichlet Allocation Model (LDA), and the Cluster Based Document Model (CBDM). The task is to locate the files that are relevant to a bug reported in the form of a textual description by a software developer. We use for our study iBUGS, a benchmarked bug localization dataset with 75 KLOC and a large number of bugs (291). A major conclusion of our comparative study is that simple text models such as UM and VSM are more effective at correctly retrieving the relevant files from a library as compared to the more sophisticated models such as LDA. The retrieval effectiveness for the various models was measured using the following two metrics: (1) Mean Average Precision; and (2) Rank-based metrics. Using the SCORE metric, we also compare the retrieval effectiveness of the models in our study with some other bug localization tools.
Abstract-Information Retrieval (IR) based bug localization techniques use a bug reports to query a software repository to retrieve relevant source files. These techniques index the source files in the software repository and train a model which is then queried for retrieval purposes. Much of the current research is focused on improving the retrieval effectiveness of these methods. However, little consideration has been given to the efficiency of such approaches for software repositories that are constantly evolving. As the software repository evolves, the index creation and model learning have to be repeated to ensure accuracy of retrieval for each new bug. In doing so, the query latency may be unreasonably high, and also, re-computing the index and the model for files that did not change is computationally redundant. We propose an incremental update framework to continuously update the index and the model using the changes made at each commit. We demonstrate that the same retrieval accuracy can be achieved but with a fraction of the time needed by current approaches. Our results are based on two basic IR modeling techniques -Vector Space Model (VSM) and Smoothed Unigram Model (SUM). The dataset we used in our validation experiments was created by tracking commit history of AspectJ and JodaTime software libraries over a span of 10 years.
This retrospective on our 2011 MSR publication starts with the research milieu that led to the work reported in our paper. We brie y review the competing ideas of a decade ago that could be applied to solving the problem of identifying the les in a software library related to a query. We were especially interested in nding out if the more complex text retrieval methods of that time would be e ective in the software context. A surprising conclusion of our paper was that the reality was exactly the opposite: the more traditional simpler methods outperformed the complex methods. In addition to this surprising result, our paper was also the rst to report what was considered at that time a large-scale quantitative evaluation of the IR-based approaches to automatic bug localization. Over the years, such quantitative evaluations have become the norm. We believe that these contributions were largely responsible for the popularity of this paper in the research literature.
The problem of bug localization is to identify the source files related to a bug in a software repository. Information Retrieval (IR) based approaches create an index of the source files and learn a model which is then queried with a bug for the relevant files. In spite of the advances in these tools, the current approaches do not take into consideration the dynamic nature of software repositories. With the traditional IR based approaches to bug localization, the model parameters must be recalculated for each change to a repository. In contrast, this paper presents an incremental framework to update the model parameters of the Latent Semantic Analysis (LSA) model as the data evolves. We compare two state-of-the-art incremental SVD update techniques for LSA with respect to the retrieval accuracy and the time performance. The dataset we used in our validation experiments was created from mining 10 years of version history of AspectJ and JodaTime software libraries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.