Programming Heterogeneous Systems from an Image Processing DSL

Pu, Jia; Bell, Susan Groag; Yang, Xuan; Setter, Jeff; Richardson, Stephen; Ragan-Kelley, Jonathan; Horowitz, Mark

doi:10.1145/3107953

Cited by 98 publications

(57 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…HIPA CC [25], for example, uses a sourceto-source compiler from a C-like front-end to generate CUDA, OpenCL, and Renderscript for targeting GPUs. Recent work on Halide [35] has demonstrated targeting heterogeneous systems, including the Xilinx Zynq's FPGA and ARM cores, by generating intermediate C++ and Vivado HLS [33]. Rigel [20] and Darkroom [19] generate Verilog, and PolyMage [14] generates OpenMP and C++ for high-level synthesis.…”

Section: Related Workmentioning

confidence: 99%

Spatial: a language and compiler for application accelerators

et al. 2018

View full text Add to dashboard Cite

Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for productivity and are difficult to target from higher level languages. HLS tools are more productive, but offer an ad-hoc mix of software and hardware abstractions which make performance optimizations difficult. In this work, we describe a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators. We describe Spatial's hardwarecentric abstractions for both programmer productivity and design performance, and summarize the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning. We demonstrate the language's ability to target FPGAs and CGRAs from common source code. We show that applications written in Spatial are, on average, 42% shorter and achieve a mean speedup of 2.9× over SDAccel HLS when targeting a Xilinx UltraScale+ VU9P FPGA on an Amazon EC2 F1 instance.

show abstract

Section: Related Workmentioning

confidence: 99%

Spatial: a language and compiler for application accelerators

et al. 2018

View full text Add to dashboard Cite

show abstract

“…However, there is not much work in the deep learning literature in general, model inference in particular. Pu et al [32] extended the image processing language, Halide [33], to allow users to specify which part of their applications they want to execute on hardware accelerators (FPGA in their case). Similar to their technique, our approach also allows users to offload different portions of a CNN to different devices so that programmers can quickly build and tune new CNN models.…”

Section: Related Workmentioning

confidence: 99%

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Wang

Chen

Liu

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (CNN ) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendorprovided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62×), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced.

show abstract

“…Because D-SWIM and Design1-2 have different throughputs, it was unfair to compare the resource number in Table 5 directly. Thus, we obtained the hardware efficiency of FPGA logic (LUT and REG) with Equation (5). Comparing with the highest-throughput design (Design1), the hardware efficiency of D-SWIM is 4.8× and 8.2× in LUT and REG, respectively.…”

Section: D-swim Buffermentioning

confidence: 99%

High-Throughput Line Buffer Microarchitecture for Arbitrary Sized Streaming Image Processing

Shi¹,

Wong²,

So³

2019

J. Imaging

View full text Add to dashboard Cite

Parallel hardware designed for image processing promotes vision-guided intelligent applications. With the advantages of high-throughput and low-latency, streaming architecture on FPGA is especially attractive to real-time image processing. Notably, many real-world applications, such as region of interest (ROI) detection, demand the ability to process images continuously at different sizes and resolutions in hardware without interruptions. FPGA is especially suitable for implementation of such flexible streaming architecture, but most existing solutions require run-time reconfiguration, and hence cannot achieve seamless image size-switching. In this paper, we propose a dynamically-programmable buffer architecture (D-SWIM) based on the Stream-Windowing Interleaved Memory (SWIM) architecture to realize image processing on FPGA for image streams at arbitrary sizes defined at run time. D-SWIM redefines the way that on-chip memory is organized and controlled, and the hardware adapts to arbitrary image size with sub-100 ns delay that ensures minimum interruptions to the image processing at a high frame rate. Compared to the prior SWIM buffer for high-throughput scenarios, D-SWIM achieved dynamic programmability with only a slight overhead on logic resource usage, but saved up to 56 % of the BRAM resource. The D-SWIM buffer achieves a max operating frequency of 329.5 MHz and reduction in power consumption by 45.7 % comparing with the SWIM scheme. Real-world image processing applications, such as 2D-Convolution and the Harris Corner Detector, have also been used to evaluate D-SWIM’s performance, where a pixel throughput of 4.5 Giga Pixel/s and 4.2 Giga Pixel/s were achieved respectively in each case. Compared to the implementation with prior streaming frameworks, the D-SWIM-based design not only realizes seamless image size-switching, but also improves hardware efficiency up to 30 × .

show abstract

Programming Heterogeneous Systems from an Image Processing DSL

Cited by 98 publications

References 47 publications

Spatial: a language and compiler for application accelerators

Spatial: a language and compiler for application accelerators

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

High-Throughput Line Buffer Microarchitecture for Arbitrary Sized Streaming Image Processing

Contact Info

Product

Resources

About