2018
DOI: 10.1109/access.2018.2882394
|View full text |Cite
|
Sign up to set email alerts
|

A Lightweight and Flexible Tool for Distinguishing Between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 24 publications
0
2
0
1
Order By: Relevance
“…However, due to the loss of data in the faulty equipment, this method often works together with the checkpoint method. Related research has also tried to improve the efficiency of error diagnosis [26]- [28], such as by using a daemon process [25]. Many of these methods rely on the MPI environment [32]- [34].…”
Section: Related Workmentioning
confidence: 99%
“…However, due to the loss of data in the faulty equipment, this method often works together with the checkpoint method. Related research has also tried to improve the efficiency of error diagnosis [26]- [28], such as by using a daemon process [25]. Many of these methods rely on the MPI environment [32]- [34].…”
Section: Related Workmentioning
confidence: 99%
“…Among them, the typical reliability models include series systems model, 17 parallel systems model, 18 series parallel‐series systems model, 19 cold storage systems model, 20 hot standby systems model, 21 and so on. In hardware reliability model, scholars built mathematical models for hardware reliability mainly through the following indicators: the reliability of products, 22 availability, 23 mean time to failure, 24 mean time to first failure, 25 fault frequency, 26 mean up‐time or mean time between failure, 27 mean time between repair, 28 mean down time, 29,30 and so on. The main research methods include extreme learning machine, 31 dynamic optimization, 32 SVM, 33,34 adaptive neuro‐fuzzy, 35,36 and so on.…”
Section: Introductionmentioning
confidence: 99%
“…Ideas similares a esta han sido adoptadas con posterioridad en otros trabajos. Por ejemplo, en[151], se plantea que, como en la ejecución de aplicaciones MPI suele haber operaciones frecuentes de intercambio de mensajes, esos mensajes son tratados como "latidos" (heartbeats), de modo de que si no hay operaciones de paso de mensajes en un proceso específico por un lapso de tiempo considerable, se sospecha de la ocurrencia de un error. el overhead puede variar significativamente[92].…”
unclassified