A Review of Supercomputer Performance Monitoring Systems

Stefanov, Konstantin; Pawar, Sucheta; Ranjan, Ashish; Wandhekar, Sanjay; Voevodin, Vladimir V.

doi:10.14529/jsfi210304

Cited by 2 publications

(2 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Monitoring RDMA networks requires specialized tools and techniques designed to capture and analyze RDMA-specific traffic and performance metrics. These tools provide insights into key parameters such as bandwidth utilization, latency, and congestion [19]. In the following, some of the known tools are discussed:…”

Section: Hpc Monitoring Toolsmentioning

confidence: 99%

In-Network Monitoring Strategies for HPC Cloud

Hemmatpour,

Larsen,

Kumar

et al. 2024

Lecture Notes on Data Engineering and Communications Technologies

View full text Add to dashboard Cite

The optimized network architectures and interconnect technologies employed in high-performance cloud computing environments introduce challenges when it comes to developing monitoring solutions that effectively capture relevant network metrics. Moreover, network monitoring often involves capturing and analyzing a large volume of network traffic data. This process can introduce additional overhead and consume system resources, potentially impacting the overall performance of HPC applications. Balancing the need for monitoring with minimal disruption to application performance is a key challenge. In this paper, we study different strategies to enable a low-overhead monitoring system utilizing emerging programmable network devices.The research project was initiated at SRL and subsequently continued at UiT.

show abstract

Section: Hpc Monitoring Toolsmentioning

confidence: 99%

In-Network Monitoring Strategies for HPC Cloud

Hemmatpour,

Larsen,

Kumar

et al. 2024

Lecture Notes on Data Engineering and Communications Technologies

View full text Add to dashboard Cite

show abstract

“…The scale of the problem motivates the development of automated procedures for anomaly detection and faulty node identification in current supercomputers and this need will become even more pressing for future Exascale systems [6]. The fact that most of today's HPC computing systems are endowed with monitoring infrastructures [7] that gather data from software (SW) and hardware (HW) components can be of great help toward the development of data-driven automated approaches. Historically, system management was performed through hand-crafted scripts and direct intervention of system administrators; most of the data is stored in log files, and anomalies are investigated a posteriori to find the source of reported problems (e.g., when many users recognize the failure and report it to administrators).…”

Section: Introductionmentioning

confidence: 99%

Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models

Molan

Borghesi

Benini

et al. 2022

Euro-Par 2022: Parallel Processing

View full text Add to dashboard Cite

Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computing (HPC) systems, where many components can fail or behave wrongly. Building a data-driven representation of the computing nodes can help with predictive maintenance and facility management. Luckily, most of the current supercomputers are endowed with monitoring frameworks that can build such representations in conjunction with Deep Learning (DL) models. In this work, we propose a novel semi-supervised DL approach based on autoencoder networks and clustering algorithms (applied to the latent representation) to build a digital twin of the computing nodes of the system. The DL model projects the node features into a lower-dimensional space. Then, clustering is applied to capture and reveal underlying, non-trivial correlations between the features. The extracted information provides valuable insights for system administrators and managers, such as anomaly detection and node classification based on their behaviour and operative conditions. We validated the approach on 240 nodes from the Marconi 100 system, a Tier-0 supercomputer located in CINECA (Italy), considering a 10-month period.

show abstract

A Review of Supercomputer Performance Monitoring Systems

Cited by 2 publications

References 30 publications

In-Network Monitoring Strategies for HPC Cloud

In-Network Monitoring Strategies for HPC Cloud

Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models

Contact Info

Product

Resources

About