Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a twoway handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.
Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and programmer productivity. Applications executing on a cluster with GPUs have to manage data movement using CUDA in addition to MPI, the de-facto parallel programming standard. Currently, data movement with CUDA and MPI libraries is not integrated and it is not as efficient as possible. In addition, MPI-2 one sided communication does not work for windows in GPU memory, as there is no way to remotely get or put data from GPU memory in a one-sided manner.In this paper, we propose a novel MPI design that integrates CUDA data movement transparently with MPI. The programmer is presented with one MPI interface that can communicate to and from GPUs. Data movement from GPU and network can now be overlapped. The proposed design is incorporated into the MVAPICH2 library. To the best of our knowledge, this is the first work of its kind to enable advanced MPI features and optimized pipelining in a widely used MPI library. We observe up to 45% improvement in one-way latency. In addition, we show that collective communication performance can be improved significantly: 32%, 37% and 30% improvement for Scatter, Gather and Allotall collective operations, respectively. Further, we enable MPI-2 one sided communication with GPUs. We observe up to 45% improvement for Put and Get operations.
Modern supercomputing systems have witnessed a phenomenal growth in the recent history owing to the advent of multi-core architectures and high speed networks. However, the operational and maintenance costs of these systems have also grown rapidly. Several concepts such as Dynamic Voltage and Frequency Scaling (DVFS) and CPU Throttling have been proposed to conserve the power consumed by the compute nodes during idle periods. However, it is necessary to design software stacks in a power-aware manner to minimize the amount of power drawn by the system during the execution of applications. It is also critical to minimize the performance overheads associated with power-aware algorithms, as the benefits of saving power could be lost if the application runs for a longer time. Modern multi-core architectures such as the Intel "Nehalem" allow for DVFS and CPU throttling operations to be performed with little overheads. In this paper, we explore how these features can be leveraged to design algorithms to deliver fine-grained power savings during the communication phases of parallel applications. We also propose a theoretical model to analyze the power consumption characteristics of communication operations. We use microbenchmarks and application benchmarks such as NAS and CPMD to measure the performance of our proposed algorithms and to demonstrate the potential for saving power with 32 and 64 processes. We observe about 8% improvement in the overall energy consumed by these applications with little performance overheads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.