In this paper, we discuss the role, design and implementation of smart containers in the SkePU skeleton library for GPU-based systems. These containers provide an interface similar to C++ STL containers but internally perform runtime optimization of data transfers and runtime memory management for their operand data on the different memory units. We discuss how these containers can help in achieving asynchronous execution for skeleton calls while providing implicit synchronization capabilities in a data consistent manner. Furthermore, we discuss the limitations of the original, already optimizing memory management mechanism implemented in SkePU containers, and propose and implement a new mechanism that provides stronger data consistency and improves performance by reducing communication and memory allocations. With several applications, we show that our new mechanism can achieve significantly (up to 33.4 times) better performance than the initial mechanism for page-locked memory on a multi-GPU based system.Keywords SkePU · Smart containers · Skeleton programming · Memory management · Runtime optimizations · GPU-based systems 1 Introduction Skeleton programming [4] for GPU-based systems is increasingly becoming popular for mapping common computational patterns. Several skeleton libraries are especially written (from scratch) targeting GPU-based systems including SkePU [10, 6], SkelCL [24] and Marrow [20]. Moreover, many existing skeleton libraries, initially written for execution on MPI-clusters and/or multicore CPUs have been ported for GPU execution, such as FastFlow [12] and Muesli [11]. These libraries differ in their 2 Usman Dastgeer, Christoph Kessler approach and feature offering but they all aim to provide performance comparable to hand-written code while requiring much less programming effort.Providing high-level abstraction with good execution performance in a library requires special design consideration. The question comes down to what is exposed to the programmer and what is handled implicitly by the skeleton library. For example, the Marrow library exposes concurrency to the application program by executing skeleton calls asynchronously; it returns a handle which can be used to synchronize execution when needed. This allows Marrow to effectively overlap computation and communication from different skeleton computations. SkelCL makes data distribution explicit so that the application programmer can choose how to map a computation to the underlying computing platform.Another important aspect in GPU computation is managing communication between CPU (main) memory and GPU (device) memory over PCIe interconnect. In Muesli, FastFlow, SkePU and SkelCL, skeleton calls can execute on a single or multicore CPU as well as on a GPU. Considering that CPUs and GPUs have separate physical memory, execution on a certain compute device may require transferring data back and forth to its associated memory if data is not already available in that memory. For example, in the following code, // 1 D arrays : v0 , v1 skel_c...
This thesis adresses issues associated with efficiently programming modern heterogeneous GPU-based systems, containing multicore CPUs and one or more programmable Graphics Processing Units (GPUs). We use ideas from component-based programming to address programming, performance and portability issues of these heterogeneous systems. Specifically, we present three approaches that all use the idea of having multiple implementations for each computation; performance is achieved/retained either a) by selecting a suitable implementation for each computation on a given platform or b) by dividing the computation work across different implementations running on CPU and GPU devices in parallel.In the first approach, we work on a skeleton programming library (SkePU) that provides high-level abstraction while making intelligent implementation selection decisions underneath either before or during the actual program execution. In the second approach, we develop a composition tool that parses extra information (metadata) from XML files, makes certain decisions offline, and, in the end, generates code for making the final decisions at runtime. The third approach is a framework that uses source-code annotations and program analysis to generate code for the runtime library to make the selection decision at runtime. With a generic performance modeling API alongside program analysis capabilities, it supports online tuning as well as complex program transformations.These approaches differ in terms of genericity, intrusiveness, capabilities and knowledge about the program source-code; however, they all demonstrate usefulness of component programming techniques for programming GPU-based systems. With experimental evaluation, we demonstrate how all three approaches, although different in their own way, provide good performance on different GPU-based systems for a variety of applications.This work has been supported by two EU FP7 projects (PEP-PHER, EXCESS) and by SeRC. Populärvetenskaplig sammanfattningAtt få varje generation av datorer att fungera snabbareär viktigt för samhä-llets utveckling och tillväxt. Traditionellt hade de flesta datorer bara en general-purpose processor (den så kallade CPU:n) som bara kunde exekvera en beräkningsuppgift i taget. Under det senasteårtiondet har dock flerkärniga och mångkärniga processorer blivit vanliga, och datorer har också blivit mer heterogena. Ett modernt datorsystem innehåller vanligtvis flerän en CPU, tillsammans med specialprocessorer såsom grafikprocessorer (GPU:er) som ar anpassade för att kunna exekvera vissa typer av beräkningar effektivarë an CPU:er. Vi kallar ett sådant system med en eller flera GPU:er för ett GPU-baserat system. GPU:er i sådana system har sitt eget separata minne, och för att kunna köra en beräkning på en GPU så behöver man vanligtvis flytta all indata till GPU:ns minne och sedan hämta tillbaka resultatet när beräkningenär klar.Programmeringen av GPU-baserade systemär icke-trivialt av flera anledningar: (1) CPU:er och GPU:er kräver olika programmeringsexper...
Abstract-The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters. The goal is to compose an application from its components and variants such that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation.We describe and evaluate the PEPPHER composition tool, which explores the application's components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. With several applications, we demonstrate how the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.