Abstract-Computer vision is one of the most active research fields in information technology today. Giving machines and robots the ability to see and comprehend the surrounding world at the speed of sight creates endless potential applications and opportunities. Feature detection and description algorithms can be indeed considered as the retina of the eyes of such machines and robots. However, these algorithms are typically computationally intensive, which prevents them from achieving the speed of sight real-time performance. In addition, they differ in their capabilities and some may favor and work better given a specific type of input compared to others. As such, it is essential to compactly report their pros and cons as well as their performances and recent advances. This paper is dedicated to provide a comprehensive overview on the state-of-the-art and recent advances in feature detection and description algorithms. Specifically, it starts by overviewing fundamental concepts. It then compares, reports and discusses their performance and capabilities. The Maximally Stable Extremal Regions algorithm and the Scale Invariant Feature Transform algorithms, being two of the best of their type, are selected to report their recent algorithmic derivatives.
Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1-3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3×. It is also observed that the FPGA performs increasingly better as a vision application's pipeline complexity grows.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.