Geographical partition for distributed web crawling

Exposto, José; Macedo, Joaquim; Pina, António Manuel Silva; Alves, Albano; Rufino, José

doi:10.1145/1096985.1096999

Cited by 26 publications

(10 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Expostoet al [28] evaluated scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones.…”

Section: Literature Reviewmentioning

confidence: 99%

Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

Al-Bahadili¹,

Qtishat²,

Naoum³

2013

IJWSC

View full text Add to dashboard Cite

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

Al-Bahadili¹,

Qtishat²,

Naoum³

2013

IJWSC

View full text Add to dashboard Cite

show abstract

“…In the literature, there have been a significant number of design alternatives, including sequential [24], parallel [5,8,20,33,36], and geographically distributed [6,14,15] Web crawlers. The three main quality objectives, common to most crawling architectures, were achieving high collection quality through download scheduling [10,26], maintaining page freshness [7,9,16,30,35], and obtaining high Web coverage [11,23].…”

Section: Previous Workmentioning

confidence: 99%

Discovering URLs through user feedback

Bai

Cambazoğlu

Junqueira

2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

show abstract

“…Exposto et al try to find the optimal locations for several Web crawlers considering the data volume and the time spent crawling [17]. Li et al study the feasibility of P2P Web search engines in terms of network bandwidth and storage space on the peers [24], and conclude that Web search using P2P technology still requires an order of magnitude more resources than available, despite a range of considered performance optimizations.…”

Section: Related Workmentioning

confidence: 99%

On the feasibility of multi-site web search engines

Baeza-Yates

Gionis

Junqueira

et al. 2009

Proceedings of the 18th ACM Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Web search engines are often implemented as centralized systems. Designing and implementing a Web search engine in a distributed environment is a challenging engineering task that encompasses many interesting research questions. However, distributing a search engine across multiple sites has several advantages, such as utilizing less compute resources and exploiting data locality. In this paper we investigate the cost-effectiveness of building a distributed Web search engine. We propose a model for assessing the total cost of a distributed Web search engine that includes the computational costs and the communication cost among all distributed sites. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results compared to a centralized search engine. We simulate the algorithm on real document collections and query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to real cost.

show abstract

Geographical partition for distributed web crawling

Cited by 26 publications

References 9 publications

Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization

Discovering URLs through user feedback

On the feasibility of multi-site web search engines

Contact Info

Product

Resources

About