An FPGA-optimized architecture of horn and schunck optical flow algorithm for real-time applications

Kunz, M.; Ostrowski, Alexander; Zipf, Peter

doi:10.1109/fpl.2014.6927406

Cited by 20 publications

(9 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CMLA's IPol website [4] also provides various codes of recent algorithms. Optimized implementations of optical flow algorithms were the subject of numerous works on FPGA [5], [6], [7], [8] and on GPU [9], [10], [11], [12], but few on CPU [11], [13]. It should also be noted that optical flow estimations based on machine learning are gaining in popularity in the scientific community [14], [15].…”

Section: Optical Flow Iterative Algorithmsmentioning

confidence: 99%

Energy and Execution Time Comparison of Optical Flow Algorithms on SIMD and GPU Architectures

Petreto¹,

Hennequin²,

Kœhler³

et al. 2018

2018 Conference on Design and Architectures for Signal and Image Processing (DASIP)

View full text Add to dashboard Cite

This article presents and compares optimized implementations of two optical flow algorithms on several target boards comprising multi-core SIMD processors and GPUs. The two algorithms are Horn-Schunck (HS) and TV-L1, and have been chosen because they are both well-known, and because of their different computational complexity and accuracy. For both algorithms, we have made parallel optimized SIMD implementations, while HS has also been implemented on GPUs. For each algorithm, the comparison between the different versions and target boards is carried out in a two-dimensional fashion: in terms of computing speed-in order to achieve real-time computation-and in terms of energy consumption since we target embedded systems. The results show that for HS, the GPUs are the most efficient in both dimensions, able to process in realtime performances (25 frames per second) up to 8Mpix images for 0.35J per image, against 1.8Mpix images for 0.24J per image on CPU. The results also highlight the impact of optimizations on TV-L1: far slower than HS without optimization, it can almost match its performance after optimization on CPU, and can achieve real-time performances with 0.25J for 1.4Mpix images. We hope these results will help developers design optical flow embedded systems.

show abstract

Section: Optical Flow Iterative Algorithmsmentioning

confidence: 99%

Energy and Execution Time Comparison of Optical Flow Algorithms on SIMD and GPU Architectures

Petreto¹,

Hennequin²,

Kœhler³

et al. 2018

2018 Conference on Design and Architectures for Signal and Image Processing (DASIP)

View full text Add to dashboard Cite

show abstract

“…Approach vs. characteritics non-tiled on-chip [6] non-tiled off-chip [5], [9] partial oblique tiling [10] full oblique tiling (ours) overlapped tiling [7] Domain size scalability…”

Section: B Overview Of the Execution Datapathmentioning

confidence: 99%

“…The non-tiled variants [5], [6], [9] (recall Table I) may be viewed as designs with tile size in the time dimension set to 1, and the remaining dimensions equal to the problem size. These designs do not exploit temporal locality, and are not suited for iterative stencils.…”

Section: Comparison With Earlier Workmentioning

confidence: 99%

See 1 more Smart Citation

One size does not fit all: Implementation trade-offs for iterative stencil computations on FPGAs

Deest

Yuki²,

Rajopadhye³

et al. 2017

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

We generate a family of FPGA stencil accelerators targeting emerging System on Chip platforms, (e.g., Xilinx Zynq or Intel SoC). Our designs come with design knobs to explore trade-offs. We also propose performance models to hone in on the most interesting design points, and show how they accurately lead to optimal designs. The optimal choice depends on problem sizes and performance goals. I. INTRODUCTIONIterative stencil computations arise in many application domains, ranging from medical imaging to numerical simulation. Since they are computationally demanding, a large body of work addressed the problem of parallelizing and optimizing stencils for multi-cores, GPUs, and FPGAs.Earlier attempts targeting FPGAs showed that the performance of such accelerators is a complex interplay between the raw FPGA computing power, the amount of on-chip memory, and the performance of the external memory system [1]- [8]. They also illustrate different application requirements. For example, in the context of embedded vision, designers often seek the cheapest design achieving real-time performance constraints (e.g., 4K@60fps). In an exascale context, they may want to maximize performance (measured in ops-persecond) for a given FPGA board, while maintaining power dissipation to a minimum. Therefore, we explore a family of design options that can accommodate a large set of constraints, by exposing trade-offs between computing power, bandwidth requirements, and FPGA resource usage. We focus on system-level issues. Our aim is not to provide hand-optimized FPGA implementations. We have developed a code generator that produces HLS-optimized C/C++ descriptions of accelerator instances, leaving low-level decisions to the HLS back-end.Our designs build upon the tiling transformation, that we use to balance on-chip memory cost and off-chip bandwidth. The design space we explore can be characterized by the following design knobs.

show abstract

“…Furthermore, the study [20] showed that pipelined image processing systems can achieve linear acceleration. Outstanding computing performances of parallel-pipelined modules for the Horn-Schunck optical flow algorithm were demonstrated in papers [31,35].…”

Section: Pipeline Data Processingmentioning

confidence: 99%

Real-time hardware–software embedded vision system for ITS smart camera implemented in Zynq SoC

Kryjak

Komorkiewicz

Gorgon

2016

J Real-Time Image Proc

View full text Add to dashboard Cite

The article demonstrates the usefulness of heterogeneous System on Chip (SoC) devices in smart cameras used in intelligent transportation systems (ITS). In a compact, energy efficient system the following exemplary algorithms were implemented: vehicle queue length estimation, vehicle detection, vehicle counting and speed estimation (using multiple virtual detection lines), as well as vehicle type (local binary features and SVM classifier) and colour (k-means classifier and YCbCr colourspace analysis) recognition. The solution exploits the hardwaresoftware architecture, i.e. the combination of reconfigurable resources and the efficient ARM processor. Most of the modules were implemented in hardware, using Verilog HDL, taking full advantage of the possible parallelization and pipeline, which allowed to obtain real-time image processing. The ARM processor is responsible for executing some parts of the algorithm, i.e. high-level image processing and analysis, as well as for communication with the external systems (e.g. traffic lights controllers). The demonstrated results indicate that modern SoC systems are a very interesting platform for advanced ITS systems and other advanced embedded image processing, analysis and recognition applications.Keywords Intelligent Transportation Systems Á Hardware-software image processing (Zynq SoC) Á Vehicle queue length estimation Á Vehicle detection Á Vehicle type and colour recognition

show abstract

An FPGA-optimized architecture of horn and schunck optical flow algorithm for real-time applications

Cited by 20 publications

References 9 publications

Energy and Execution Time Comparison of Optical Flow Algorithms on SIMD and GPU Architectures

Energy and Execution Time Comparison of Optical Flow Algorithms on SIMD and GPU Architectures

One size does not fit all: Implementation trade-offs for iterative stencil computations on FPGAs

Real-time hardware–software embedded vision system for ITS smart camera implemented in Zynq SoC

Contact Info

Product

Resources

About