Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218
This article is categorized under:
Algorithmic Development > Web Mining
Fundamental Concepts of Data and Knowledge > Information Repositories
Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining
Regression test selection is a process to select a subset of existing test cases, which are then used with some new test cases for regression testing. Regression testing ensures that the changes made to the system have not affected the existing functionality. To date, there is no adequate technique which can do regression test selection by considering changes in semantics of operations (along with the other syntax and semantics changes) using UML diagrams. Change in semantics of an operation refers to the change in conditional statements, change in independent paths/unique paths, change in control flow and addition or deletion of any content from the existing functionality. In this study, a novel approach is presented which can do this using class, sequence and activity diagrams. The tool compared old and new versions of UML diagrams to categorise test cases into reusable, retestable, obsolete and newly generated category. Activity diagrams are specifically used to test the semantics of operations. The changed operations corresponding to these activity diagrams are also searched in class and sequence diagrams for regression test selection. This study has been validated by comparison with previous study. It is found that the authors' work provides significant increase in accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.