2020 IEEE International Conference on Cluster Computing (CLUSTER) 2020
DOI: 10.1109/cluster49012.2020.00071
|View full text |Cite
|
Sign up to set email alerts
|

Global Experiences with HPC Operational Data Measurement, Collection and Analysis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 19 publications
(11 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…To answer these research questions, in this manuscript we extend Examon, a state-of-the-art ODA framework [13], [15], deployed on the CINECA 1 data center, to integrate Nagios [16] monitored events. We are the first in exploring the possibility to use Nagios as an annotation tool (to provide normal and faulty state labels).…”
Section: A Contributionsmentioning
confidence: 99%
See 1 more Smart Citation
“…To answer these research questions, in this manuscript we extend Examon, a state-of-the-art ODA framework [13], [15], deployed on the CINECA 1 data center, to integrate Nagios [16] monitored events. We are the first in exploring the possibility to use Nagios as an annotation tool (to provide normal and faulty state labels).…”
Section: A Contributionsmentioning
confidence: 99%
“…As the size of supercomputing systems approaches the exascale, it is common to adopt Operational Data measurement, collection and Analysis (ODA) frameworks [13] to continuously monitor system information data (mostly in the form of multivariate time series data), such as data coming from physical sensors' telemetry (temperature, power), micro-architectural events (IPC, cache misses), data coming from the computing resources and facility [13]- [15]. These do not contain the records of node's and system failure events.…”
mentioning
confidence: 99%
“…This often leads to multiple tools being used, in turn resulting in complex and fragmented software stacks and in a wide disarray of monitoring data that is difficult to use effectively [22]. This statement is confirmed in a survey conducted by the Energy Efficient HPC Working Group (EEHPCWG) in 2019 [39] regarding the use of ODA in several HPC centers: most sites employ varying sets of monitoring, storage and analysis solutions that either rely on in-house systems, or on commercial products that are not tailored for data center monitoring, thus restricting administrators to simple ODAV visual inspection.…”
Section: State Of the Artmentioning
confidence: 99%
“…Each of these individual steps has been explored thoroughly in the ODA research field, but there is currently a severe lack of end-to-end experiences, from design down to maintenance, covering all aspects of the ODA pipeline and providing the necessary insights and solutions to propel forward the capillary adoption of ODA in production data center environments. It has been observed in the literature, in fact, that most HPC centers rely on insular ODA solutions tackling only specific aspects of the problem [39], with no clear applicability to other domains.…”
Section: Introductionmentioning
confidence: 99%
“…If we keep track of application executions and recognize that a job executes a known application, we can: (a) make predictions about resource usage based on executions in the past (improving job scheduling [14] and predicting energy consumption [12]), (b) detect deviations from past resource usage (indicating anomalies and potential errors), (c) detect resource usage of known malicious applications (e.g. cryptocurrency mining [5]), and (d) lower power consumption by reducing CPU frequency for memorybound applications [10].…”
Section: Introductionmentioning
confidence: 99%