2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) 2019
DOI: 10.1109/aicas.2019.8771527
|View full text |Cite
|
Sign up to set email alerts
|

Online Anomaly Detection in HPC Systems

Abstract: Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
35
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 35 publications
(36 citation statements)
references
References 19 publications
1
35
0
Order By: Relevance
“…In this direction, different methods were proposed to improve the technique, not only for MD, but also more in general for AD. In the context of DCs, works in [28]- [30], [32] showed how to use performance counters to detect anomalies, while [33] showed how to use them for detecting covert cryptocurrency mining. However, very recent works in [34], [35] bring into question the robustness of this technique for security, carrying out a study with a big dataset with more than ninety malware and reporting poor results in accuracy detection.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In this direction, different methods were proposed to improve the technique, not only for MD, but also more in general for AD. In the context of DCs, works in [28]- [30], [32] showed how to use performance counters to detect anomalies, while [33] showed how to use them for detecting covert cryptocurrency mining. However, very recent works in [34], [35] bring into question the robustness of this technique for security, carrying out a study with a big dataset with more than ninety malware and reporting poor results in accuracy detection.…”
Section: Related Workmentioning
confidence: 99%
“…For the ML inference phase, we use an AE, which is a particular kind of Neural Network suitable for Anomaly Detection [30], [42]. As described in Section II, the idea is to train a model on "healthy" activity of the DC (i.e., with no malware involved), and try to identify possible anomalies in these signatures when a malware is running in background.…”
Section: B Edge Malware Detection Inferencementioning
confidence: 99%
“…Processing streaming data generated from HPC systems is challenging as a result of the large volumes and generating speed. An online supervised learning method is proposed in [49] to operate with live streamed data; Another online anomaly detection method using autoencoder is provided in [50]. To better support streaming logs analysis, a visual analytic framework is proposed in [51], which consists of data management, analysis and interactive visualization.…”
Section: Machine Learning and Hpc Securitymentioning
confidence: 99%
“…With the fast growing of their size and complexity, modern information technology systems contain numerous sources of potential faults and vulnerabilities [3,6,8]. The failure of their services can have significant consequences, ranging from degraded user's experience [11] to important financial losses [15].…”
Section: Introductionmentioning
confidence: 99%
“…For instance, a late appearance of logs in a sequence may indicate a performance anomaly corresponding to an abnormal temporal irregularity in a service response. Hence, log anomaly detection is recognized as an efficient mean to perform system anomaly detection [3]. Manually analyzing large and complex log datasets represents a cumbersome and error-prone task [3,8], justifying the need for automated data-driven solutions.…”
Section: Introductionmentioning
confidence: 99%