Architecture of a grid-enabled Web search engine

Cambazoğlu, B. Barla; Karaca, Evren; Küçükyılmaz, Tayfun; Türk, Ata; Aykanat, Cevdet

doi:10.1016/j.ipm.2006.10.011

Cited by 25 publications

(8 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are many research studies in web crawling, such as URL ordering for retrieving highquality pages earlier [16], partitioning the web for efficient multi-processor crawling [17], distributed crawling [18], and focused crawling [19]. There are some crawling architectures that designed based grid computing that aims to increase the performance in specific issues in web crawling, such as a grid focused community crawling architecture for medical information retrieval services [20], Multi Agent System-based crawlers for Virtual Organizations [21], a dynamic URL assignment method for parallel web crawler [22], A Middleware for Deep Web Crawling Using the Grid [23], and Architecture of a grid-enabled Web search engine [24].…”

Section: Related Workmentioning

confidence: 99%

Crawler Architecture Using Grid Computing

Elaraby¹

2012

IJCSIT

View full text Add to dashboard Cite

Crawler is one of the main components in the search engines which use URLs to fetch web pages to build a repository of web pages starting with entering URL. Each web page is parsed to extract the URLs included in it and store the extracted URLs in the URLs Queue to fetch by the crawlers in sequential. The process of crawling takes long time to collect more web pages, and it has become necessary to utilize the unused computing resources and cost/time savings in organizations. This paper deals with the crawler of search engine using grid computing. This paper presents the grid computing that has been implemented by Alchemi. Alchemi is an open source project developed at the University of Melbourne, provides middleware for creating an enterprise grid computing environment. The crawling processes are passed to Alchemi manager which distribute the processes over a number of computers as executors. The search engine crawler with the grid computing is implemented, tested and the results are analyzed. There is an increase in performance and less time over the single computer.

show abstract

Section: Related Workmentioning

confidence: 99%

Crawler Architecture Using Grid Computing

Elaraby¹

2012

IJCSIT

View full text Add to dashboard Cite

show abstract

“…However, consistent hashing prevents the system from optimizations on network distance. In IPMicra [8], [9], crawler is selected to crawl a certain Web site if they were located within the same AS or ISP network according to the information provided by Regional Internet Registries (RIRs); SE4SEE [10] reduces the network distance by assigning crawler Web sites that were located within the crawler's country. The metrics of network distance the above two systems adopt (AS differences and geographical distances) cannot fully reveal the Web hosts' true positions on the Internet, because the routers' routing strategies usually don't comply with the restrictions of ASs or cities.…”

Section: Related Workmentioning

confidence: 99%

Exploring Web Partition in DHT-Based Distributed Web Crawling

Xiao

Zhang

et al. 2010

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe basic requirements of the distributed Web crawling systems are: short download time, low communication overhead and balanced load which largely depends on the systems' Web partition strategies. In this paper, we propose a DHT-based distributed Web crawling system and several DHT-based Web partition methods. First, a new system model based on a DHT method called the Content Addressable Network (CAN) is proposed. Second, based on this model, a network-distance-based Web partition is implemented to reduce the crawler-crawlee network distance in a fully distributed manner. Third, by utilizing the locality on the link space, we propose the concept of link-based Web partition to reduce the communication overhead of the system. This method not only reduces the number of inter-links to be exchanged among the crawlers but also reduces the cost of routing on the DHT overlay. In order to combine the benefits of the above two Web partition methods, we then propose 2 distributed multi-objective Web partition methods. Finally, all the methods we propose in this paper are compared with existing system models in the simulated experiments under different datasets and different system scales. In most cases, the new methods show their superiority.

show abstract

“…However, it prevents the system from optimizations on network distance. In IPMicra [8], crawler is selected to crawl a certain Web site if they were located in the same AS or ISP network according to the information provided by Regional Internet Registries (RIRs); SE4SEE [27] reduces the network distance by assigning crawler Web sites that were located within the crawler's country; Apoidea [8] implements a Chord-based DWC system. However, it didn't make optimizations in reducing the crawler-host distance.…”

Section: Related Workmentioning

confidence: 99%

Efficient Distributed Web Crawling Utilizing Internet Resources

Xiao

Zhang

et al. 2010

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYInternet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawlercrawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.

show abstract

Architecture of a grid-enabled Web search engine

Cited by 25 publications

References 36 publications

Crawler Architecture Using Grid Computing

Crawler Architecture Using Grid Computing

Exploring Web Partition in DHT-Based Distributed Web Crawling

Efficient Distributed Web Crawling Utilizing Internet Resources

Contact Info

Product

Resources

About