International audienceNowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps
International audienceHigh performance computing with low cost machines becomes a reality with GPU. Unfortunately, high performances are achieved when the programmer exploits the architectural specificities of the GPU processors: he has to focus on inter-GPU communications, task allocations among the GPUs, task scheduling, external memory prefetching, and synchronization. In this paper, we propose and evaluate a compile flow. It automates the transformation of a program expressed with the high level system design language SystemC, to its implementation on a cluster of multi-GPU. SystemC constructs and scheduler are directly mapped to the GPU API, preserving their semantic. Inter-GPU communications are abstracted by means of SystemC channels
The digital signal processing (DSP) applications are one of the biggest consumers of computing. They process a big data volume which is represented with a high accuracy. They use complex algorithms, and must satisfy a time constraints in most of cases. In the other hand, it's necessary today to use parallel and heterogeneous architectures in order to speedup the processing, where the best examples are the supercomputers "Tianhe-2" and "Titan" from the top500 ranking. These architectures could contain several connected nodes, where each node includes a number of generalist processor (multi-core) and a number of accelerators (many-core) to finally allows several levels of parallelism. However, for DSP programmers, it's still complicated to exploit all these parallelism levels to reach good performance for their applications. They have to design their implementation to take advantage of all heterogeneous computing units, taking into account the architecture specificities of each of them: communication model, memory management, data management, jobs scheduling and synchronization. .. etc. In the present work, we characterize DSP applications, and based on their distinctiveness, we propose a high level visual programming model and an execution model in order to drop down their implementations and in the same time make desirable performances.
The most popular Moore's law formulation, which states the number of transistors on integrated circuits doubles every 18 months, is said to hold for at least another two decades. According to this prediction, if we want to take advantage of technological evolutions, designer's productivity has to increase in the same proportions. To take up this challenge, system level design solutions have been set up, but many efforts have still to be done on system modelling and synthesis. In this paper we propose a computation core synthesis methodology that can be integrated on the communication refinement steps of electronic system level design tools. In the proposed approach, computation cores used for digital signal processing application specifications relying on coarse grain communications and synchronizations (e.g. matrix) can be refined into computation cores which can handle fine grain communications and synchronizations (e.g. scalar). Its originality is its ability to synthesize computation cores which can handle fine grain data consumptions and productions which respect the intrinsic partial orders of the algorithms while preserving their original functionalities. Such cores can be used to model fine grain input output overlapping or iteration pipelining. Our flow is based on the analysis of a fine grain signal flow graph used to extract fine grain synchronizations and algorithmic expressions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.