Efficient Partitioning Strategies for Distributed Web Crawling

Exposto, José; Macedo, Joaquim; Pina, António Manuel Silva; Alves, Albano; Rufino, José

doi:10.1007/978-3-540-89524-4_54

Cited by 10 publications

(5 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another aspect that has drawn the attention of researchers is the efficient partitioning mechanisms of the Web space. Work done by Exposto et al [39] has presented a multi-objective approach for partitioning the Web space by modeling the Web hosts and IP hosts as graphs. These graphs are partitioned, and a new graph is created with the weights calculated using the original weights and the edge-cuts.…”

Section: Web Crawlers and Crawling Techniquesmentioning

confidence: 99%

Change Detection and Notification of Web Pages

et al. 2020

View full text Add to dashboard Cite

The majority of currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages, and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages that contain information that is of value to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process, and to notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights the current challenges and areas of improvement present within the field of research.

show abstract

Section: Web Crawlers and Crawling Techniquesmentioning

confidence: 99%

Change Detection and Notification of Web Pages

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In the literature, there have been a significant number of design alternatives, including sequential [24], parallel [5,8,20,33,36], and geographically distributed [6,14,15] Web crawlers. The three main quality objectives, common to most crawling architectures, were achieving high collection quality through download scheduling [10,26], maintaining page freshness [7,9,16,30,35], and obtaining high Web coverage [11,23].…”

Section: Previous Workmentioning

confidence: 99%

Discovering URLs through user feedback

Bai

Cambazoğlu

Junqueira

2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

show abstract

“…If there is a social objective, the placement of crawlers becomes important and should conform as much as possible with the placement of the target content. The spatial locality refers to the geographical placement and closeness of Web sites to crawlers [16,17]. Note that this is different from the country-specific content objective as belonging to a country does not always guarantee low spatial proximity (e.g., very large or disconnected countries) and spatial proximity does not always guarantee belonging to the same country (e.g., sites near the country borders).…”

Section: External Factorsmentioning

confidence: 99%

“…A distributed crawler has the potential to use network resources more efficiently than a centralized crawler. Consider a distributed crawling system, where each crawler is responsible for downloading Web pages that are stored on servers geographically close to itself [16,17]. In such a scenario, the average network latency for downloading Web pages is expected to be smaller if compared to the average latency of a scenario where Web pages are downloaded by a central crawler.…”

Section: Benefitsmentioning

confidence: 99%

On the Feasibility of Geographically Distributed Web Crawling

Cambazoğlu

Plachouras

Junqueira

et al. 2008

Proceedings of the Third International ICST Conference on Scalable Information Systems

View full text Add to dashboard Cite

We identify the issues that are important in design of a geographically distributed Web crawler. The identified issues are discussed from a "benefit" and "challenge" point of view. More specifically, we focus on the effect of geographical locality of Web sites on crawling performance, and, as a practical study, investigate the feasibility of a distributed crawler in terms of network costs. For this purpose, we conduct various experiments to collect network access statistics about the servers in the educational domains of eight different countries (USA,

show abstract

Efficient Partitioning Strategies for Distributed Web Crawling

Cited by 10 publications

References 7 publications

Change Detection and Notification of Web Pages

Change Detection and Notification of Web Pages

Discovering URLs through user feedback

On the Feasibility of Geographically Distributed Web Crawling

Contact Info

Product

Resources

About