Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

Beretta, Ivan; Rana, Vincenzo; Akın, Abdulkadir; Nacci, Alessandro Antonio; Sciuto, D.; Atienza, David

doi:10.1145/2851497

Cited by 5 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our key insight is that the impact of a single input element can be computed by repeatedly applying the original stencil to an impulse signal. 2 Concretely, the impulse array has the value one at its middle position and zeros at all other positions. The length of the impulse array depends on the radius of the stencil pattern and the number of iterations to harvest parallelism from.…”

Section: Computing Dce Coefficientsmentioning

confidence: 99%

“…The cone-based ISL acceleration approach is the most similar approach to DCMI described in the literature. We have chosen CA [2,29,40] as the representative of this class as they share our goal of providing an automatic design flow. Zohouri et al [60] takes an approach that is architecturally similar to CA but realized through OpenCL.…”

Section: Fpga-based Isl Acceleratorsmentioning

confidence: 99%

See 1 more Smart Citation

Dcmi

Koraei

Fatemi

Jahre

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Iterative Stencil Loops (ISLs) are the key kernel within a range of compute-intensive applications. To accelerate ISLs with Field Programmable Gate Arrays, it is critical to exploit parallelism (1) among elements within the same iteration and (2) across loop iterations. We propose a novel ISL acceleration scheme called Direct Computation of Multiple Iterations (DCMI) that improves upon prior work by pre-computing the effective stencil coefficients after a number of iterations at design time-resulting in accelerators that use minimal on-chip memory and avoid redundant computation. This enables DCMI to improve throughput by up to 7.7× compared to the state-of-the-art cone-based architecture. CCS Concepts: • Computer systems organization → Architectures; • Computing methodologies → Parallel computing methodologies;

show abstract

Section: Computing Dce Coefficientsmentioning

confidence: 99%

Section: Fpga-based Isl Acceleratorsmentioning

confidence: 99%

Dcmi

Koraei

Fatemi

Jahre

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Different implementations exist on CPU namely the original [10] and improved [11] versions, a parallel OpenMP version [18] and a SIMD version [19]. FPGA implementations have also been developed [20] and have been optimised in memory allocation and power consumption [21]. Finally, GPU implementations have been developed for the original [10], improved [11] and further optimised TV-L1 versions [22], [23].…”

Section: Introductionmentioning

confidence: 99%

Implementations Impact on Iterative Image Processing for Embedded GPU

Romera

Petreto

Lemaître

et al. 2021

2021 29th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

The emergence of low-power embedded Graphical Processing Units (GPUs) with high computation capabilities has enabled the integration of image processing chains in a wide variety of embedded systems. Various optimisation techniques are however needed in order to get the most out of an embedded GPU. This paper explores several optimisation methods for iterative stencil-like image processing algorithms on embedded NVIDIA GPUs using the Compute Unified Device Architecture (CUDA) API. We chose to focus our architectural optimisations on the TV-L1 algorithm, an optical flow estimation method based on total variation (TV) regularisation and the L1 norm. It is widely used as a model for more complex optical flow estimations and is used in many recent video processing applications. In this work we evaluate the impact of architecture-oriented optimisations on both execution time and energy consumption on several Nvidia Jetson GPU embedded boards. Results show a speedup up to 3× compared to State-of-the-Art versions as well as a 2.6× decrease in energy consumption.

show abstract