Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

Al-Bahadili, Hussein; Qtishat, Hamzah; Naoum, Reyadh

doi:10.5121/ijwsc.2013.4102

Cited by 2 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The basic information acquisition Web search engine tool is becoming increasingly important because of the explosion in size and increasing demand of users for finding the information [1][2] [4].The Web search engines are information retrieval software systems that help in finding the information stored on the internet by taking input query words, and retrieving the information based on the matching criteria. Some search engines mine data available in databases or open directories.…”

Section: Introductionmentioning

confidence: 99%

“…First, it should have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have highly optimized system architecture, i.e., robust against crashes, manageable, and considerate of resources and web servers [1].The performance of the crawling process has been improved by using a parallel web crawler instead of a batch crawler. But, the existing parallel crawler as a single point coordinator with high chances of data redundancy leading to crawling of the same URLs multiple times thus affects the performance [3].…”

Section: Introductionmentioning

confidence: 99%

“…A Web crawler starts with a list of URLs to visit, called as the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier[1][3][4].URLs from the frontier is recursively visited according to a set of policies.Fig.l. outlines the crawling process of a basic crawler[6].…”

mentioning

confidence: 99%

See 2 more Smart Citations

Virtualized dynamic URL assignment web crawling model

Bhaginath

Shingade

Shirole

2014

2014 International Conference on Advances in Engineering &Amp; Technology Research (ICAETR - 2014)

View full text Add to dashboard Cite

Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing.The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation