Abstract:One of the most challenging aspects in information retrieval systems is to crawl and index deep web. A deep web is part of World Wide Web which is not visible publically and therefore can't be indexed. There is a huge amount of scholarly data, images and videos available in deep web which if indexed can serve purpose of research and stop illegal activities. We propose an efficient hidden web crawler that uses Sampling and Associativity Rules in order to find the most important and relevant keywords which are used to generate queries that can extract information from databases and web forms. Further, we use random forest technique to index out search results. Our web crawler has capabilities to efficiently overcome various prior challenges that we have stated in this paper.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.