Sylvain Huet scite author profile

Fristot

et al. 2012

J Real-Time Image Proc

International audienceNowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps

SysCellC: a data-flow programming model on multi-GPU

Houzet

Procedia Computer Science

Rahman

2010

International audienceHigh performance computing with low cost machines becomes a reality with GPU. Unfortunately, high performances are achieved when the programmer exploits the architectural specificities of the GPU processors: he has to focus on inter-GPU communications, task allocations among the GPUs, task scheduling, external memory prefetching, and synchronization. In this paper, we propose and evaluate a compile flow. It automates the transformation of a program expressed with the high level system design language SystemC, to its implementation on a cluster of multi-GPU. SystemC constructs and scheduler are directly mapped to the GPU API, preserving their semantic. Inter-GPU communications are abstracted by means of SystemC channels

A Visual Programming Model to Implement Coarse-Grained DSP Applications on Parallel and Heterogeneous Clusters

Mansouri

Houzet

2014

The digital signal processing (DSP) applications are one of the biggest consumers of computing. They process a big data volume which is represented with a high accuracy. They use complex algorithms, and must satisfy a time constraints in most of cases. In the other hand, it's necessary today to use parallel and heterogeneous architectures in order to speedup the processing, where the best examples are the supercomputers "Tianhe-2" and "Titan" from the top500 ranking. These architectures could contain several connected nodes, where each node includes a number of generalist processor (multi-core) and a number of accelerators (many-core) to finally allows several levels of parallelism. However, for DSP programmers, it's still complicated to exploit all these parallelism levels to reach good performance for their applications. They have to design their implementation to take advantage of all heterogeneous computing units, taking into account the architecture specificities of each of them: communication model, memory management, data management, jobs scheduling and synchronization. .. etc. In the present work, we characterize DSP applications, and based on their distinctiveness, we propose a high level visual programming model and an execution model in order to drop down their implementations and in the same time make desirable performances.

A Computation Core for Communication Refinement of Digital Signal Processing Algorithms

Casseau

Pasquier³

2006

The most popular Moore's law formulation, which states the number of transistors on integrated circuits doubles every 18 months, is said to hold for at least another two decades. According to this prediction, if we want to take advantage of technological evolutions, designer's productivity has to increase in the same proportions. To take up this challenge, system level design solutions have been set up, but many efforts have still to be done on system modelling and synthesis. In this paper we propose a computation core synthesis methodology that can be integrated on the communication refinement steps of electronic system level design tools. In the proposed approach, computation cores used for digital signal processing application specifications relying on coarse grain communications and synchronizations (e.g. matrix) can be refined into computation cores which can handle fine grain communications and synchronizations (e.g. scalar). Its originality is its ability to synthesize computation cores which can handle fine grain data consumptions and productions which respect the intrinsic partial orders of the algorithms while preserving their original functionalities. Such cores can be used to model fine grain input output overlapping or iteration pipelining. Our flow is based on the analysis of a fine grain signal flow graph used to extract fine grain synchronizations and algorithmic expressions.