E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak; Schwaller, Benjamin; Aaziz, Omar; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

doi:10.1007/978-3-030-85665-6_5

Cited by 8 publications

(3 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This data can then be used in a supervised learning task directly or after processing new features (feature construction). Examples of this approach are [45,17,46] where authors use supervised ML approaches to classify the performance variations and joblevel faults in HPC systems. For fault detection, [8,18] propose a supervised approach based on Random Forest (an ensemble method based on decision trees) to classify faults in an HPC system.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Ruad: Unsupervised Anomaly Detection in Hpc Systems

Molan¹,

Borghesi

Cesarini³

et al. 2022

SSRN Journal

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

“…Tabular data Time series Supervised [49,9] [47, 48, 10] Semi-supervised [5,6,43,22] Unsupervised [19,20] [21] The novelty of this paper is, in relation to the existing works, threefold:…”

Section: Related Workmentioning

confidence: 99%

Ruad: Unsupervised Anomaly Detection in Hpc Systems

Molan¹,

Borghesi

Cesarini³

et al. 2022

SSRN Journal

View full text Add to dashboard Cite

“…Since anomalies in HPC systems are rare events, the problem of anomaly detection cannot be treated as a classical supervised learning problem [17,21]; the majority of works that treat it in a fully supervised fashion have been tested using synthetic [14,22] or injected anomalies [15]. Instead of learning the properties of both relevant classes, the standard approach is to learn just the properties of the system's normal operation -anything deviating from this normal operation is then recognized as an anomaly.…”

Section: Related Workmentioning

confidence: 99%

Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models

Molan

Borghesi

Benini

et al. 2022

Euro-Par 2022: Parallel Processing

View full text Add to dashboard Cite

Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computing (HPC) systems, where many components can fail or behave wrongly. Building a data-driven representation of the computing nodes can help with predictive maintenance and facility management. Luckily, most of the current supercomputers are endowed with monitoring frameworks that can build such representations in conjunction with Deep Learning (DL) models. In this work, we propose a novel semi-supervised DL approach based on autoencoder networks and clustering algorithms (applied to the latent representation) to build a digital twin of the computing nodes of the system. The DL model projects the node features into a lower-dimensional space. Then, clustering is applied to capture and reveal underlying, non-trivial correlations between the features. The extracted information provides valuable insights for system administrators and managers, such as anomaly detection and node classification based on their behaviour and operative conditions. We validated the approach on 240 nodes from the Marconi 100 system, a Tier-0 supercomputer located in CINECA (Italy), considering a 10-month period.

show abstract