Using Replication for Resilience on Exascale Systems

Casanova, Henri; Vivien, Frédéric; Zaidouni, Dounia

doi:10.1007/978-3-319-20943-2_4

Cited by 11 publications

(11 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the term core to represent the computing resource allocation unit [28]. We further use P (σ, w, t) to denote a process executing at rate σ to complete a workload w by time t. The basic tenet of Lazy Shadowing is the concept of shadowing, whereby each process is associated with a lazy replica.…”

Section: Lazy Shadowingmentioning

confidence: 99%

“…The impact of process replication on MNFTI has been studied in [28]. Our problem is equivalent to that, with the difference that our work can tolerate one failure in each shadowed set while [28] can tolerate one failure in each replica-group of size 2.…”

Section: Application Failure Probabilitymentioning

confidence: 99%

“…Our problem is equivalent to that, with the difference that our work can tolerate one failure in each shadowed set while [28] can tolerate one failure in each replica-group of size 2. Therefore, we can apply the methodology in [28] to our case, and the MNFTI with Lazy Shadowing for different number of shadowed sets (S) is shown in Table 1. Note that when processes are not replicated, every failure would interrupt the application, i.e., MNFTI=1, so MNFTI can be significantly increased by Lazy Shadowing.…”

Section: Application Failure Probabilitymentioning

confidence: 99%

See 2 more Smart Citations

Adaptive and Power-Aware Resilience for Extreme-scale Computing

Cui

Znati

Melhem

2017

FSP

View full text Add to dashboard Cite

With the concerted efforts from researchers in hardware, software, algorithm, and data management, HPC is moving towards extreme-scale, featuring a computing capability of quintillion (10 18 ) FLOPS. As we approach the new era of computing, however, several daunting scalability challenges remain to be conquered. Delivering extreme-scale performance will require a computing platform that supports billion-way parallelism, necessitating a dramatic increase in the number of computing, storage, and networking components. At such a large scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unprecedented amount of power consumption.To tackle this challenge, we propose an adaptive and power-aware algorithm, referred to as Lazy Shadowing, as an efficient and scalable approach to achieve high-levels of resilience, through forward progress, in extreme-scale, failure-prone computing environments. Lazy Shadowing associates with each process a "shadow" (process) that executes at a reduced rate, and opportunistically rolls forward each shadow to catch up with the its leading process during failure recovery. Compared to existing fault tolerance methods, our approach can achieve 20% energy saving with potential reduction in solution time at scale.

show abstract

Section: Lazy Shadowingmentioning

confidence: 99%

Section: Application Failure Probabilitymentioning

confidence: 99%

Section: Application Failure Probabilitymentioning

confidence: 99%

See 1 more Smart Citation

Adaptive and Power-Aware Resilience for Extreme-scale Computing

Cui

Znati

Melhem

2017

FSP

View full text Add to dashboard Cite

show abstract

“…Replication remains the most transparent and least intrusive technique and can be used at different levels (duplication, triplication or even more) . Combined with checkpointing, replication comes with two flavors: process replication [24,25] and group replication [26]. Process replication applies to message-passing applications with communicating processes.…”

Section: Related Workmentioning

confidence: 99%

Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis

Cavelan

Fang

Chien

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.

show abstract

“…When transparent replication is not (yet) provided by the runtime system, one solution could be to implement it explicitly within the application, but this is a labor-intensive process especially for legacy applications. Another approach introduced in [15] is group replication, a technique that can be used whenever process replication is not available. Group replication is agnostic to the parallel programming model, and thus views the application as an unmodified black box.…”

Section: Introductionmentioning

confidence: 99%