2018
DOI: 10.1016/j.jpdc.2018.08.002
|View full text |Cite
|
Sign up to set email alerts
|

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0
7

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 13 publications
(15 citation statements)
references
References 49 publications
0
8
0
7
Order By: Relevance
“…which is the well-known and original Young formula [42]. Variants of Equation (4) have been proposed in the literature, such as T opt = 2(µ + R)C in [13] [24]. All variants are approximations that collapse to Equation (4).…”
Section: With a Single Processormentioning
confidence: 99%
See 1 more Smart Citation
“…which is the well-known and original Young formula [42]. Variants of Equation (4) have been proposed in the literature, such as T opt = 2(µ + R)C in [13] [24]. All variants are approximations that collapse to Equation (4).…”
Section: With a Single Processormentioning
confidence: 99%
“…Also, Ni et al [30] introduce process duplication to cope with both fail-stop and silent errors. Recently, Benoit et al [4] extended these work to general applications, and compare traditional process replication with group replication, where the whole application is replicated as a black box. They analyze several scenarios with duplication or triplication.…”
Section: Related Workmentioning
confidence: 99%
“…Más recientemente, una serie de estudios [128] han permitido llegar a la conclusión de que el M T BF depende principalmente de la cantidad de procesadores, resultando inversamente proporcional al tamaño del sistema. Por lo tanto, desde el punto de vista de la resiliencia, la escala es el gran enemigo [14]. Se proyecta de los sistemas de exa-escala contengan del orden de decenas o centenares de millones de cores dentro de la década actual; de hecho, el supercomputador que ocupa actualmente el tercer lugar de la lista del Top500 (es decir, en noviembre de 2019 -https://www.top500.org/list/2019/11/) tiene 10.649.600 cores (Sunway T aihuLight).…”
Section: Algunos Casos Realesunclassified
“…Debido a que la replicación se da actualmente a nivel de procesos, la escala se vuelve un problema aún más grave [14]. Con millones de procesadores (y billones de threads), la probabilidad de errores durante las ejecuciones puede llegar a ser significativa, dependiendo de si los fabricantes de circuitos incrementen o no significativamente la protección sobre la lógica, los latches, los f lip − f lops y los arreglos estáticos en los procesadores.…”
Section: Algunos Casos Realesunclassified
See 1 more Smart Citation