Abstract-Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behavior in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and in turn, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of cores, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures.
I. MOTIVATION AND IMPACTNetwork congestion is widely recognized as one of the primary causes of performance degradation, performance variability, and poor scaling in communication-heavy applications running on supercomputers [5]. However, due to the complex nature of interconnection networks, as well as message injection and routing strategies, network congestion and its root causes for network resources and hardware components are not well understood. This makes the problem of mitigating and avoiding network congestion difficult. It also complicates the task of writing congestionavoiding and congestion-minimizing algorithms for communication and task mapping. Therefore, we need new approaches to understand and model network congestion in order to improve the performance of large-scale parallel applications.When a message is sent from one node to another, it is split into packets that pass through many resources and hardware components on the network. A packet starts in an injection FIFO on the source. It then passes through multiple network links and receive buffers on intermediate nodes before it finally lands in the reception FIFO on the destination. When shared by multiple packets, any or all of these network components can slow down individual flits, packets and messages. This paper aims to identify the hardware components that affect the performance of sending a message the most.Our approach is based on using supervised machine learning to build models that map from independent variables, representing different network hardware components, to a dependent variable -the execution time of the application. We only consider computationally balanced, communicationheavy parallel applications and, hence, focus on the communication fraction of the total executi...