2021
DOI: 10.14529/jsfi210304
|View full text |Cite
|
Sign up to set email alerts
|

A Review of Supercomputer Performance Monitoring Systems

Abstract: High Performance Computing is now one of the emerging fields in computer science and its applications. Top HPC facilities, supercomputers, offer great opportunities in modeling diverse processes thus allowing to create more and greater products without full-scale experiments. Current supercomputers and applications for them are very complex and thus are hard to use efficiently. Performance monitoring systems are the tools that help to understand the efficiency of supercomputing applications and overall superco… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 30 publications
0
2
0
Order By: Relevance
“…Monitoring RDMA networks requires specialized tools and techniques designed to capture and analyze RDMA-specific traffic and performance metrics. These tools provide insights into key parameters such as bandwidth utilization, latency, and congestion [19]. In the following, some of the known tools are discussed:…”
Section: Hpc Monitoring Toolsmentioning
confidence: 99%
“…Monitoring RDMA networks requires specialized tools and techniques designed to capture and analyze RDMA-specific traffic and performance metrics. These tools provide insights into key parameters such as bandwidth utilization, latency, and congestion [19]. In the following, some of the known tools are discussed:…”
Section: Hpc Monitoring Toolsmentioning
confidence: 99%
“…The scale of the problem motivates the development of automated procedures for anomaly detection and faulty node identification in current supercomputers and this need will become even more pressing for future Exascale systems [6]. The fact that most of today's HPC computing systems are endowed with monitoring infrastructures [7] that gather data from software (SW) and hardware (HW) components can be of great help toward the development of data-driven automated approaches. Historically, system management was performed through hand-crafted scripts and direct intervention of system administrators; most of the data is stored in log files, and anomalies are investigated a posteriori to find the source of reported problems (e.g., when many users recognize the failure and report it to administrators).…”
Section: Introductionmentioning
confidence: 99%