2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018
DOI: 10.1109/ftxs.2018.00010
|View full text |Cite
|
Sign up to set email alerts
|

CPU Overheating Characterization in HPC Systems: A Case Study

Abstract: With the increase in size of supercomputers, also increases the number of abnormal events. Some of these events might lead to an application failure. Others might simply impact the system efficiency. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper studies the problem of CPU overheating in supercomputers. In a first part, we analyze data collected over one year on a supercomputer of the top500 list to understand under which condi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 16 publications
1
8
0
Order By: Relevance
“…Concerning the running applications, even if we identified specific applications that generate CPU overheating events, only a few of their runs are impacted by an overheating event. More details about the analyses of these two points can be found in our previous publication 8 . We conclude from these results that the occurrence of overheating events does not have a single cause.…”
Section: Analysis Of the Overheating Eventssupporting
confidence: 54%
See 4 more Smart Citations
“…Concerning the running applications, even if we identified specific applications that generate CPU overheating events, only a few of their runs are impacted by an overheating event. More details about the analyses of these two points can be found in our previous publication 8 . We conclude from these results that the occurrence of overheating events does not have a single cause.…”
Section: Analysis Of the Overheating Eventssupporting
confidence: 54%
“…We study the occurrence of the CPU overheating events in a production supercomputer to identify their possible sources. The results presented in this paper are a summary of the main results that we described in a previous publication 8 . The analysis shows that there is no obvious correlation between the occurrence of overheating events and the load of the system, and that we cannot identify nodes that have a significantly higher probability of experiencing overheating events or applications that would increase the probability of overheating.…”
Section: Introductionmentioning
confidence: 65%
See 3 more Smart Citations