Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite the recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system with dynamic slicing and scheduling techniques to improve the throughput of concurrent kernel executions on the GPU. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices and to fully utilize the GPU resources. We develop a novel and effective Markov chain based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31.1% and 23.4% performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.
This paper demonstrates Medusa, a programming framework for parallel graph processing on graphics processors (GPUs). Medusa enables developers to leverage the massive parallelism and other hardware features of GPUs by writing sequential C/C++ code for a small set of APIs. This simplifies the implementation of parallel graph processing on the GPU. The runtime system of Medusa automatically executes the user-defined APIs in parallel on the GPU, with a series of graph-centric optimizations based on the architecture features of GPUs. We will demonstrate the steps of developing GPU-based graph processing algorithms with Medusa, and the superior performance of Medusa with both real-world and synthetic datasets.
This paper examines the performance of collective communication operations in Message Passing Interfaces (MPI) in the cloud computing environment. The awareness of network topology has been a key factor in performance optimizations for existing MPI implementations. However, virtualization in the cloud environment not only hides the network topology information from the users, but also causes traffic interference and dynamics to network performance. Existing topology-aware optimizations are no longer feasible in the cloud environment. Therefore, we develop novel network performance aware algorithms for a series of collective communication operations including broadcast, reduce, gather and scatter. We further implement two common applications, N-body and conjugate gradient (CG). We have conducted our experiments with two complementary methods (on Amazon EC2 and simulations). Our experimental results show that the network performance awareness results in 25.4% and 28.3% performance improvement over MPICH2 on Amazon EC2 and on simulations, respectively. Evaluations on N-body and CG show 41.6% and 14.3% respectively on application performance improvement. Index Terms-Cloud Computing, MPI, Collective Operations, Network Performance Optimizations INTRODUCTIONCloud computing has emerged as a popular computing paradigm for many distributed and parallel applications. Message Passing Interface (MPI) is a common and key software component in distributed and parallel applications, and its performance is the key factor for the network communication efficiency. This paper investigates whether and how we can improve the performance of MPI in the cloud computing environment.Since collective communications are the most important MPI operations for the system performance [13], [14], [17], this paper focuses on the efficiency of MPI collective communication operations. Network topology aware algorithms have been applied to optimize the performance of collective communication operations [13], [28], [26], [14], [17]. Most of the studies adopt tree-based algorithms, since the network topology is often tree-structured. The essential idea of those algorithms is to obtain the topology information with hardware • Yifan Gong, Bingsheng He and Jianlong Zhong are with
Modern GPUs have been widely used to accelerate the graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or the atomic operations, leading to significant penalties/overheads when implemented on GPUs. To this end, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. We propose a light-weight asynchronous processing framework called Frog with a hybrid coloring model. We find that majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution will separate the processing of the vertices based on the distribution of colors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.