2014
DOI: 10.14778/2735496.2735506
|View full text |Cite
|
Sign up to set email alerts
|

Fast failure recovery in distributed graph processing systems

Abstract: Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
22
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 36 publications
(24 citation statements)
references
References 25 publications
2
22
0
Order By: Relevance
“…Single Failure: Instead of utilizing a new available node to substitute the failed node in [1,19], the healthy nodes are responsible for confined recovery in our approach once a single node failure is detected. Hence, the confined recovery process is applied in parallel on all the remaining healthy nodes.…”
Section: B Parallel Confined Recoverymentioning
confidence: 99%
See 1 more Smart Citation
“…Single Failure: Instead of utilizing a new available node to substitute the failed node in [1,19], the healthy nodes are responsible for confined recovery in our approach once a single node failure is detected. Hence, the confined recovery process is applied in parallel on all the remaining healthy nodes.…”
Section: B Parallel Confined Recoverymentioning
confidence: 99%
“…However, this new node affords all the recomputation workload. A partitionbased reassignment algorithm [19] is proposed to accelerate the recovery process through simultaneous reduction of recovery communication costs and parallelization of the recovery computations. However, it relies on the statistics of the graph partition state kept by additional resources or components.…”
Section: B Failure Recoverymentioning
confidence: 99%
“…The simplest approach is to restart the entire computation. The most well-known alternatives include data materialization at synchronization boundaries [14,18,47], checkpointing the state of the entire computation either synchronously [33,36] or asynchronously [31], and restarting using lineage tracking and periodic checkpoints [40,49].…”
Section: Failure Handlingmentioning
confidence: 99%
“…Considering that there could be as many as 150,000 candidate n-grams per input (sentence) [5], the communication cost becomes prohibitively expensive. Another challenge of implementing distributed n-gram models is related to the network failure, which happens quite often in a distributed system with a large amount of network communication [23]. If some n-gram messages get lost, the model would produce inaccurate estimation of the probability.…”
Section: Introductionmentioning
confidence: 99%