A Framework for Incremental Domain-Specific Hidden Web Crawler

Madaan, Rosy; Dixit, Ashutosh; Sharma, Ashok Kumar; Bhatia, Komal Kumar

doi:10.1007/978-3-642-14834-7_39

Cited by 9 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has three main modules: URL wrapper, OAI handler, and XLST processor. Madaan et al devised an incremental hidden Web crawler for domain‐specific Web. Proposed architecture has following modules: domain‐specific hidden web crawler (DSHWC), URL extractor, revisits frequency calculator, update module, and dispatcher.…”

Section: Current Status Of Web Crawlermentioning

confidence: 99%

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218 This article is categorized under: Algorithmic Development > Web Mining Fundamental Concepts of Data and Knowledge > Information Repositories Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining

show abstract

Section: Current Status Of Web Crawlermentioning

confidence: 99%

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

show abstract

“…It is operated by collecting data and by maintaining sessions between servers to facilitate script execution by clients. Madaan et al (2010) produced an incremental web crawler in order to store outcomes immediately reflected by the information change provided by the deep web. This method determines the cycle of crawler visits in terms of probability, calculates the changing cycle, and applies the optimal value for a revisit.…”

Section: Methodsmentioning

confidence: 99%

Design and implementation of crawling algorithm to collect deep web information for web archiving

Won²,

Kim

et al. 2018

DTA

View full text Add to dashboard Cite

Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

show abstract

“…Furthermore, Singhal et al (2010) proposed a new approach to regulate the revisiting frequency, a new mechanism and architecture for the incremental crawler. Madaan et al (2010) also proposed a new architecture to continuously update the hidden web depositary. Moreover, others focussed on parallel crawler processing by combining augmentations to hypertext documents (Sharma et al, 2003a(Sharma et al, , b, 2010.…”

Section: Content Type Diversity and Crawling Issuesmentioning

confidence: 99%

Search engines crawling process optimization: a webserver approach

Zineddine

2016

Internet Research

View full text Add to dashboard Cite

Purpose – The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach. Design/methodology/approach – A new algorithm was formulated based on best existing algorithms to optimize the existing traffic caused by web crawlers, which is approximately 40 percent of all networking traffic. The crux of this approach is that web servers monitor and log changes and communicate them as an XML file to search engines. The XML file includes the information necessary to generate refreshed pages from existing ones and reference new pages that need to be crawled. Furthermore, the XML file is compressed to decrease its size to the minimum required. Findings – The results of this study have shown that the traffic caused by search engines’ crawlers might be reduced on average by 84 percent when it comes to text content. However, binary content faces many challenges and new algorithms have to be developed to overcome these issues. The proposed approach will certainly mitigate the deep web issue. The XML files for each domain used by search engines might be used by web browsers to refresh their cache and therefore help reduce the traffic generated by normal users. This reduces users’ perceived latency and improves response time to http requests. Research limitations/implications – The study sheds light on the deficiencies and weaknesses of the algorithms monitoring changes and generating binary files. However, a substantial decrease of traffic is achieved for text-based web content. Practical implications – The findings of this research can be adopted by web server software and browsers’ developers and search engine companies to reduce the internet traffic caused by crawlers and cut costs. Originality/value – The exponential growth of web content and other internet-based services such as cloud computing, and social networks has been causing contention on available bandwidth of the internet network. This research provides a much needed approach to keeping traffic in check.

show abstract

A Framework for Incremental Domain-Specific Hidden Web Crawler

Cited by 9 publications

References 5 publications

A survey of Web crawlers for information retrieval

A survey of Web crawlers for information retrieval

Design and implementation of crawling algorithm to collect deep web information for web archiving

Search engines crawling process optimization: a webserver approach

Contact Info

Product

Resources

About