2008
DOI: 10.1007/978-3-540-89524-4_54
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Partitioning Strategies for Distributed Web Crawling

Abstract: This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The investigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infrastructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers' pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two gr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2008
2008
2020
2020

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 7 publications
0
5
0
Order By: Relevance
“…Another aspect that has drawn the attention of researchers is the efficient partitioning mechanisms of the Web space. Work done by Exposto et al [39] has presented a multi-objective approach for partitioning the Web space by modeling the Web hosts and IP hosts as graphs. These graphs are partitioned, and a new graph is created with the weights calculated using the original weights and the edge-cuts.…”
Section: Web Crawlers and Crawling Techniquesmentioning
confidence: 99%
“…Another aspect that has drawn the attention of researchers is the efficient partitioning mechanisms of the Web space. Work done by Exposto et al [39] has presented a multi-objective approach for partitioning the Web space by modeling the Web hosts and IP hosts as graphs. These graphs are partitioned, and a new graph is created with the weights calculated using the original weights and the edge-cuts.…”
Section: Web Crawlers and Crawling Techniquesmentioning
confidence: 99%
“…In the literature, there have been a significant number of design alternatives, including sequential [24], parallel [5,8,20,33,36], and geographically distributed [6,14,15] Web crawlers. The three main quality objectives, common to most crawling architectures, were achieving high collection quality through download scheduling [10,26], maintaining page freshness [7,9,16,30,35], and obtaining high Web coverage [11,23].…”
Section: Previous Workmentioning
confidence: 99%
“…If there is a social objective, the placement of crawlers becomes important and should conform as much as possible with the placement of the target content. The spatial locality refers to the geographical placement and closeness of Web sites to crawlers [16,17]. Note that this is different from the country-specific content objective as belonging to a country does not always guarantee low spatial proximity (e.g., very large or disconnected countries) and spatial proximity does not always guarantee belonging to the same country (e.g., sites near the country borders).…”
Section: External Factorsmentioning
confidence: 99%
“…A distributed crawler has the potential to use network resources more efficiently than a centralized crawler. Consider a distributed crawling system, where each crawler is responsible for downloading Web pages that are stored on servers geographically close to itself [16,17]. In such a scenario, the average network latency for downloading Web pages is expected to be smaller if compared to the average latency of a scenario where Web pages are downloaded by a central crawler.…”
Section: Benefitsmentioning
confidence: 99%