Summary
With the increase in size of supercomputers, also increases the number of abnormal events. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper presents a machine learning solution to predict such events. The proposed algorithm is based on dynamic time warping for feature extraction and on a machine learning algorithm for classification. It predicts overheating events solely by analyzing the trends of the temperature of the CPUs and can deal with very low temperature sampling rates while having a negligible computational cost in practice. Our evaluation, using data coming from a production supercomputer, shows that the proposed solution can make predictions a few minutes in advance with a good accuracy. Furthermore, considering two simple preventive actions to avoid CPU overheating events, we present an analytical study that shows that our predictive solution is good enough to allow a significant reduction of the cost of overheating events.