Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

Tagliavini, Giuseppe; Haugou, Germain; Marongiu, Andrea; Benini, Luca

doi:10.1007/s11554-015-0544-0

Cited by 12 publications

(6 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are no publicly-available benchmarks for OpenVX. In the literature, previous work on optimizing OpenVX graphs uses a set of relatively small graphs to evaluate the proposed techniques [14] [15], making it difficult to generalize results. In order to evaluate our approach over a large set of OpenVX graphs, we developed a tool that to randomly generate synthetic graphs having a given number of kernels.…”

Section: Automated Kernel Fusing and Tile Size Selectionmentioning

confidence: 99%

Optimizing OpenVX Graphs for Data Movement

Abeysinghe,

Villarreal,

Bakos

2024

Preprint

View full text Add to dashboard Cite

show abstract

Section: Automated Kernel Fusing and Tile Size Selectionmentioning

confidence: 99%

Optimizing OpenVX Graphs for Data Movement

Abeysinghe,

Villarreal,

Bakos

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…In the OpenCL abstract model, each instance of the execution kernel is called a work-item, which is represented by its coordinates in the NDRange. The corresponding hardware is the processing element [34]. Multiple workitems are organized as a work-group, providing a coarser division of NDRange, where work-items in a given workgroup are executed concurrently on the processing element of a compute unit.…”

Section: A Opencl Parallel Computing Platformmentioning

confidence: 99%

A Parallel Algorithm of Image Mean Filtering Based on OpenCL

et al. 2021

View full text Add to dashboard Cite

The image will be contaminated by noise during the imaging process, which severely degrades the image quality. It is necessary to filter the collected image. With the increasing amount of image data, the traditional single-processor or multiprocessor computing equipment has been unable to meet the requirements of real-time data processing. In this paper, the computational model of weighted mean filtering and the characteristics of high performance computer architecture are studied. An efficient hierarchical image weighted mean filtering parallel algorithm for Open Computing Language (OpenCL) is designed and implemented, which can fully express the parallelism of the computing model. The parallel algorithm takes full account of the characteristics of image discrete convolution computing and the multilayer logic architecture of high performance computer, deeply excavates the parallelism of the computing platform and computing model, and realizes the efficient task mapping from computing model to computing resources. The model is implemented in parallel with the two levels of work-group and workitem. The experimental results show that compared with the serial algorithm based on CPU, the parallel algorithm based on Open Multi-Processing (OpenMP) and the parallel algorithm based on Compute Unified Device Architecture (CUDA), the parallel algorithm of weighted mean filtering achieves 20.88 times, 18.52 times and 1.26 times acceleration ratio on the NVIDIA GPU computing platform based on OpenCL architecture, respectively. It realizes better computing performance and runs on different Graphic Processing Unit (GPU) computing platforms, and has good portability and scalability.INDEX TERMS weighted mean filtering; Gaussian noise; Graphic Processing Unit (GPU); Open Computing Language (OpenCL); parallel algorithm.

show abstract

“…It has been implemented by a few major vendors, including Nvidia, Intel, AMD, and Synopsys [28]. The authors of [5,9,25,31,32] focus on graph scheduling and design space exploration for heterogeneous systems consisting of GPUs, CPUs, and custom instruction-set architectures. Unlike the prior work, [24] suggests static OpenVX compilation for low-power embedded systems instead of runtime-library implementations.…”

Section: Related Workmentioning

confidence: 99%

“…3, where redundant computations are eliminated, and nodes are aggregated for better exploitation of locality. Memory access patterns of our abstractions entail system-level optimization strategies motivated by the OpenVX standard, such as image tiling [25] and hardware-software partitioning [26]. An abstractionbased implementation allows expressing aggregated computations as part of the reconstructed graph.…”

Section: Computational Abstractionsmentioning

confidence: 99%

HipaccVX: wedding of OpenVX and DSL-based code generation

Ozkan

Qiao

et al. 2020

J Real-Time Image Proc

View full text Add to dashboard Cite

Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX ’ algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing domain-specific language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.

show abstract

Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

Cited by 12 publications

References 37 publications

Optimizing OpenVX Graphs for Data Movement

Optimizing OpenVX Graphs for Data Movement

A Parallel Algorithm of Image Mean Filtering Based on OpenCL

HipaccVX: wedding of OpenVX and DSL-based code generation

Contact Info

Product

Resources

About