2018
DOI: 10.1186/s13173-018-0069-z
|View full text |Cite
|
Sign up to set email alerts
|

Running resilient MPI applications on a Dynamic Group of Recommended Processes

Abstract: High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 53 publications
(103 reference statements)
0
0
0
1
Order By: Relevance
“…Portanto, o diagnóstico assume implicitamente o modelo síncrono . Embora um modelo anterior de diagnóstico em nível de sistema tenha assumido testes que não são perfeitos [Camargo and Duarte 2018], apenas no trabalho recente de ] os algoritmos de diagnóstico são especificados como detectores de falhas.…”
Section: Introduc ¸ãOunclassified
“…Portanto, o diagnóstico assume implicitamente o modelo síncrono . Embora um modelo anterior de diagnóstico em nível de sistema tenha assumido testes que não são perfeitos [Camargo and Duarte 2018], apenas no trabalho recente de ] os algoritmos de diagnóstico são especificados como detectores de falhas.…”
Section: Introduc ¸ãOunclassified