A Lightweight and Flexible Tool for Distinguishing Between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs

Zhang, Guozhen; Liu, Yi; Yang, Hailong; Qian, Depei

doi:10.1109/access.2018.2882394

Cited by 5 publications

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, due to the loss of data in the faulty equipment, this method often works together with the checkpoint method. Related research has also tried to improve the efficiency of error diagnosis [26]- [28], such as by using a daemon process [25]. Many of these methods rely on the MPI environment [32]- [34].…”

Section: Related Workmentioning

confidence: 99%

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

2020

Self Cite

View full text Add to dashboard Cite

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta-and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems; e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library. INDEX TERMS Algorithm-based fault tolerance, HPC systems, node failure, matrix multiplication, linear algebra computations.

show abstract

Section: Related Workmentioning

confidence: 99%

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Among them, the typical reliability models include series systems model, 17 parallel systems model, 18 series parallel‐series systems model, 19 cold storage systems model, 20 hot standby systems model, 21 and so on. In hardware reliability model, scholars built mathematical models for hardware reliability mainly through the following indicators: the reliability of products, 22 availability, 23 mean time to failure, 24 mean time to first failure, 25 fault frequency, 26 mean up‐time or mean time between failure, 27 mean time between repair, 28 mean down time, 29,30 and so on. The main research methods include extreme learning machine, 31 dynamic optimization, 32 SVM, 33,34 adaptive neuro‐fuzzy, 35,36 and so on.…”

Section: Introductionmentioning

confidence: 99%

Chaotic neural network model for SMISs reliability prediction based on interdependent network SMISs reliability prediction by chaotic neural network

Zhu

Zhu-ping

Sun

et al. 2020

Quality & Reliability Eng

View full text Add to dashboard Cite

With the development of industrial Internet, smart manufacturing information systems (SMISs) are faced with more uncertainties, dynamics, and complexity. These problems bring more challenges to the reliability operation of SMISs. To solve the above problem, a prediction model based on phase space reconstruction, chaos analysis, and back propagation (BP) neural network is proposed to predict SMISs reliability. First, we decompose failure data series into some subdata series components with strong regularity by using C‐C algorithm and Cao algorithm. On this basis, we use the maximum Lyapunov index to identify chaotic characteristics of failure data series. And then, we establish BP neural network prediction model by using reconstructing failure data to predict SMISs failure behaviors. Finally, we use two groups of failure data series to verify the effectiveness of chaotic BP neural network model, and the experiment results verify that chaotic BP neural network model has more accurate prediction results compared with BP network, support vector machine, long short term memory networks (LSTM), and autoregressive model (AR). LEAD PARAGRAPH SMISs are open and complex systems, human error and external environment cause the uncertainty of reliability. The threat of the external environment mainly comes from malicious attacks and the threat of the human error mainly comes from the wrong operation of the operator. Human error and external environment often cause frequent failures of software and hardware of the system or physical failures of devices. As an open complex system, the reliability operation of SMISs is very important for manufacturing enterprises. However, in the daily use of SMISs, the most common failures are time failure caused by human error and external environment. Therefore, it is very important to study the time failure of SMISs. Main points of this paper: (1) SMISs are complex and open systems, so we establish an interdependent network based on the characteristics of SMISs, and use the cascade effect of the complex network to point out that when SMISs fail, the system will easily fall into failure. (2) The phase space reconstruction method is used to restore the real data characteristics of failure data. (3) By using the reconstructed data and the neural network, the failure behavior can be predicted accurately. Compared with other popular prediction methods, it is found that general machine learning methods cannot predict data with chaotic characteristics. The research results of this paper find that when SMISs fail, the failure behavior can easily lead SMISs into chaos through the propagation of interdependent network. Therefore, when future scholars conduct fault analysis on SMISs, they should consider the chaos of the data, otherwise the systems fault analysis and diagnosis cannot be carried out accurately.

show abstract

“…Ideas similares a esta han sido adoptadas con posterioridad en otros trabajos. Por ejemplo, en[151], se plantea que, como en la ejecución de aplicaciones MPI suele haber operaciones frecuentes de intercambio de mensajes, esos mensajes son tratados como "latidos" (heartbeats), de modo de que si no hay operaciones de paso de mensajes en un proceso específico por un lapso de tiempo considerable, se sospecha de la ocurrencia de un error. el overhead puede variar significativamente[92].…”

unclassified

SEDAR: Detección y recuperación automática de fallos transitorios en sistemas de cómputo de altas prestaciones

Montezanti¹

View full text Add to dashboard Cite

El manejo de fallos es una preocupación creciente en el contexto del HPC; en el futuro, se esperan mayores variedades y tasas de errores, intervalos de detección más largos y fallos silenciosos. Se proyecta que, en los próximos sistemas de exa-escala, los errores ocurran incluso varias veces al día y se propaguen en grandes aplicaciones paralelas, generando desde caídas de procesos hasta corrupciones de resultados debidas a fallos no detectados. En este trabajo se propone SEDAR, una metodología que mejora la fiabilidad, frente a los fallos transitorios, de un sistema que ejecuta aplicaciones paralelas de paso de mensajes. La solución diseñada, basada en replicación de procesos para la detección, combinada con diferentes niveles de checkpointing (checkpoints de nivel de sistema o de nivel de aplicación) para recuperar automáticamente, tiene el objetivo de ayudar a los usuarios de aplicaciones científicas a obtener ejecuciones confiables con resultados correctos. La detección se logra replicando internamente cada proceso de la aplicación en threads y monitorizando los contenidos de los mensajes entre los threads antes de enviar a otro proceso; además, los resultados finales se validan para prevenir la corrupción del cómputo local. Esta estrategia permite relanzar la ejecución desde el comienzo ni bien se produce la detección, sin esperar innecesariamente hasta la conclusión incorrecta. Para la recuperación, se utilizan checkpoints de nivel de sistema, pero debido a que no existe garantía de que un checkpoint particular no contenga errores silenciosos latentes, se requiere el almacenamiento y mantenimiento de múltiples checkpoints, y se implementa un mecanismo para reintentar recuperaciones sucesivas desde checkpoints previos si el mismo error se detecta nuevamente. La última opción es utilizar un único checkpoint de capa de aplicación, que puede ser verificado para asegurar su validez como punto de recuperación seguro. En consecuencia, SEDAR se estructura en tres niveles: (1) sólo detección y parada segura con notificación al usuario; (2) recuperación basada en una cadena de checkpoints de nivel de sistema; y (3) recuperación basada en un único checkpoint válido de capa de aplicación. Cada una de estas variantes brinda una cobertura particular, pero tiene limitaciones inherentes y costos propios de implementación; la posibilidad de elegir entre ellos provee flexibilidad para adaptar la relación costo-beneficio a las necesidades de un sistema particular. Se presenta una descripción completa de la metodología, su comportamiento en presencia de fallos y los overheads temporales de emplear cada una de las alternativas. Se describe un modelo que considera varios escenarios de fallos y sus efectos predecibles sobre una aplicación de prueba para realizar una verificación funcional. Además, se lleva a cabo una validación experimental sobre una implementación real de la herramienta SEDAR, utilizando diferentes benchmarks con patrones de comunicación disímiles. El comportamiento en presencia de fallos, inyectados controladamente en distintos momentos de la ejecución, permite evaluar el desempeño y caracterizar el overhead asociado a su utilización. Tomando en cuenta esto, también se establecen las condiciones bajo las cuales vale la pena comenzar con la protección y almacenar varios checkpoints para recuperar, en lugar de simplemente detectar, detener la ejecución y relanzar. Las posibilidades de configurar el modo de uso, adaptándolo a los requerimientos de cobertura y máximo overhead permitido de un sistema particular, muestran que SEDAR es una metodología eficaz y viable para la tolerancia a fallos transitorios en entornos de HPC.

show abstract

A Lightweight and Flexible Tool for Distinguishing Between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs

Cited by 5 publications

References 24 publications

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

Chaotic neural network model for SMISs reliability prediction based on interdependent network SMISs reliability prediction by chaotic neural network

SEDAR: Detección y recuperación automática de fallos transitorios en sistemas de cómputo de altas prestaciones

Contact Info

Product

Resources

About