2021 IEEE International Conference on Cluster Computing (CLUSTER) 2021
DOI: 10.1109/cluster48925.2021.00086
|View full text |Cite
|
Sign up to set email alerts
|

A Conceptual Framework for HPC Operational Data Analytics

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(9 citation statements)
references
References 53 publications
0
9
0
Order By: Relevance
“…Recently, several works have been proposed to extend the power monitoring attainable from the voltage regulator modules leveraging shunt resistors, current probes, and out-ofband telemetry [23]. In addition, Operational Data Analytics [26] (ODA) has been introduced focusing on monitoring and managing large scale HPC installations. In this area, vertical solutions encompassing all layers (from data gathering and storage to processing and analysis) have been proposed.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, several works have been proposed to extend the power monitoring attainable from the voltage regulator modules leveraging shunt resistors, current probes, and out-ofband telemetry [23]. In addition, Operational Data Analytics [26] (ODA) has been introduced focusing on monitoring and managing large scale HPC installations. In this area, vertical solutions encompassing all layers (from data gathering and storage to processing and analysis) have been proposed.…”
Section: Related Workmentioning
confidence: 99%
“…Monitoring the health of all those subsystems is an increasingly daunting task for system administrators. To simplify this monitoring task and reduce the time between anomaly insurgency and response by the administrators, automatic anomaly detection systems have been introduced in recent years [3].…”
Section: Introductionmentioning
confidence: 99%
“…Modern supercomputers are endowed with monitoring systems that give the system administrators a holistic view of the system [3]. Data collected by these monitoring systems and historical data describing system availability are the basis for Ma-chine Learning anomaly detection approaches [6,7,8,9,10], which build data-driven models of the supercomputer and its computing nodes.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations