Dajie Ge scite author profile

Dajie Ge

4Publications

6Citation Statements Received

21Citation Statements Given

How they've been cited

How they cite others

Affiliations

Tongji University

Publications

Order By: Most citations

A task scheduling strategy based on weighted round robin for distributed crawler

Ding

2015

Concurrency and Computation

View full text Add to dashboard Cite

Summary With the rapid development of the network, stand‐alone crawlers are finding hard to find and gather information. Distributed crawlers are gradually accepted to solve this problem. This paper proposes a task scheduling strategy based on weighted round robin for small‐scale distributed crawler with formula weights for the current node based on crawling efficiency, implements a distributed crawler system with multithreading support and deduplication which takes the algorithm as core, and discusses some possible extensions and details. The design of the error recovery mechanism and the node table allows crawling nodes have flexible scalability and fault tolerance. Finally, we conducted some experiments to prove the good load balancing performance of the system. Concurrency and Computation: Practice and Experience, 2015.© 2015 Wiley Periodicals, Inc. Copyright © 2015 John Wiley & Sons, Ltd.

show abstract

Robots exclusion and guidance protocol

Ding

2016

Tinshhua Sci. Technol.

View full text Add to dashboard Cite

With the rapid development of the Internet, general-purpose web crawlers have increasingly become unable to meet people's individual needs as they are no longer efficient enough to fetch deep web pages. The presence of several deep web pages in the websites and the widespread use of Ajax make it difficult for generalpurpose web crawlers to fetch information quickly and efficiently. On the basis of the original Robots Exclusion Protocol (REP), a Robots Exclusion and Guidance Protocol (REGP) is proposed in this paper, by integrating the independent scattered expansions of the original Robots Protocol developed by major search engine companies.Our protocol expands the file format and command set of the REP as well as two labels of the Sitemap Protocol.Through our protocol, websites can express their aspects of requirements for restrictions and guidance to the visiting crawlers, and provide a general-purpose fast access of deep web pages and Ajax pages for the crawlers, and facilitates crawlers to easily obtain the open data on websites effectively with ease. Finally, this paper presents a specific application scenario, in which both a website and a crawler work with support from our protocol. A series of experiments are also conducted to demonstrate the efficiency of the proposed protocol.

show abstract

A Novel Information Search and Recommendation Services Platform Based on an Indexing Network (Short Paper)

Deng

Jiang

Sun

et al. 2013

View full text Add to dashboard Cite

A Task Scheduling Strategy Based on Weighted Round-Robin for Distributed Crawler

Ding

2014

View full text Add to dashboard Cite

SUMMARYWith the rapid development of the network, stand-alone crawlers are finding hard to find and gather information. Distributed crawlers are gradually accepted to solve this problem. This paper proposes a task scheduling strategy based on weighted round robin for small-scale distributed crawler with formula weights for the current node based on crawling efficiency, implements a distributed crawler system with multithreading support and deduplication which takes the algorithm as core, and discusses some possible extensions and details. The design of the error recovery mechanism and the node table allows crawling nodes have flexible scalability and fault tolerance. Finally, we conducted some experiments to prove the good load balancing performance of the system.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dajie Ge

A task scheduling strategy based on weighted round robin for distributed crawler

Robots exclusion and guidance protocol

A Novel Information Search and Recommendation Services Platform Based on an Indexing Network (Short Paper)

A Task Scheduling Strategy Based on Weighted Round-Robin for Distributed Crawler

Contact Info

Product

Resources

About