With the rapid development of the Internet, general-purpose web crawlers have increasingly become unable to meet people's individual needs as they are no longer efficient enough to fetch deep web pages. The presence of several deep web pages in the websites and the widespread use of Ajax make it difficult for generalpurpose web crawlers to fetch information quickly and efficiently. On the basis of the original Robots Exclusion Protocol (REP), a Robots Exclusion and Guidance Protocol (REGP) is proposed in this paper, by integrating the independent scattered expansions of the original Robots Protocol developed by major search engine companies.Our protocol expands the file format and command set of the REP as well as two labels of the Sitemap Protocol.Through our protocol, websites can express their aspects of requirements for restrictions and guidance to the visiting crawlers, and provide a general-purpose fast access of deep web pages and Ajax pages for the crawlers, and facilitates crawlers to easily obtain the open data on websites effectively with ease. Finally, this paper presents a specific application scenario, in which both a website and a crawler work with support from our protocol. A series of experiments are also conducted to demonstrate the efficiency of the proposed protocol.
SUMMARYWith the rapid development of the network, stand-alone crawlers are finding hard to find and gather information. Distributed crawlers are gradually accepted to solve this problem. This paper proposes a task scheduling strategy based on weighted round robin for small-scale distributed crawler with formula weights for the current node based on crawling efficiency, implements a distributed crawler system with multithreading support and deduplication which takes the algorithm as core, and discusses some possible extensions and details. The design of the error recovery mechanism and the node table allows crawling nodes have flexible scalability and fault tolerance. Finally, we conducted some experiments to prove the good load balancing performance of the system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.