18.2 A 1.9TOPS and 564GOPS/W heterogeneous multicore SoC with color-based object classification accelerator for image-recognition applications

Tanabe, Jun; Sano, Tsuyoshi; Yamada, Y.; Watanabe, Takahiro; Okumura, Mutsuko; Nishiyama, Masayoshi; Nomura, Tadakazu; Oma, Kazushige; Sato, Nobuhiro; Banno, Moriyasu; Hayashi, H.; Miyamori, Takashi

doi:10.1109/isscc.2015.7063059

Cited by 19 publications

(7 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Work described in [38,37,31,14] belong to the first category. In particular, in [38] and [37] HOG-LBP and HOG-Local Self-Similarity (LSS) algorithms are respectively deployed to increase the detection accuracy with multiple algorithm predictions.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, in [38] and [37] HOG-LBP and HOG-Local Self-Similarity (LSS) algorithms are respectively deployed to increase the detection accuracy with multiple algorithm predictions. A further improvement has been recently proposed in [31], where a HOG-LBP detection process is applied simultaneously and separately on the three channels of color images. These implementations rely on weighted inferences among the available detections to improve the final decision result.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Bio-inspired heterogeneous architecture for real-time pedestrian detection applications

Maggiani

Bourrasset

Quinton

et al. 2016

J Real-Time Image Proc

View full text Add to dashboard Cite

International audienceAlong with the development of powerful processing platforms, heterogeneous architectures are nowadays permitting new design space explorations. In this paper we propose a novel heterogeneous architecture for reliable pedestrian detection applications. It deploys an efficient Histogram of Oriented Gradient pipeline tightly coupled with a neuro-inspired spatio-temporal filter. By relying on hardware-software co-design principles, our architecture is capable of processing video sequences from real-word dynamic environments in real-time. The paper presents the implemented algorithm and details the proposed architecture for executing it, exposing in particular the partitioning decisions made to meet the required performance. A prototype implementation is described and the results obtained are discussed with respect to other state of the art solutions

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Bio-inspired heterogeneous architecture for real-time pedestrian detection applications

Maggiani

Bourrasset

Quinton

et al. 2016

J Real-Time Image Proc

View full text Add to dashboard Cite

show abstract

“…Other hard-wired HWAs are employed to perform video and audio coding at low-power consumption. Image recognition is another common function for which HWAs are used [Tanabe et al 2012;Stein et al 2008;Tanabe et al 2015] because image recognition requires a large amount of computation. At design time, an HWA designer calculates a deadline for each execution of the HWA from the high-level performance target (e.g., frame rate) and designs the HWA such that its computation finishes ahead of the deadline.…”

Section: Heterogeneous Soc Architecturementioning

confidence: 99%

“…For example, CPU cores and Graphics Processing Units (GPUs) are often integrated in smartphone SoCs [Qualcomm 2011]. Hard-wired HWAs are implemented in a very wide range of SoCs [Tanabe et al 2012;Stein et al 2008;Tanabe et al 2015], including smartphones.…”

Section: Introductionmentioning

confidence: 99%

Dash

Usui

Subramanian

Chang

et al. 2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern SoCs integrate multiple CPU cores and hardware accelerators (HWAs) that share the same main memory system, causing interference among memory requests from different agents. The result of this interference, if it is not controlled well, is missed deadlines for HWAs and low CPU performance. Few previous works have tackled this problem. State-of-the-art mechanisms designed for CPU-GPU systems strive to meet a target frame rate for GPUs by prioritizing the GPU close to the time when it has to complete a frame. We observe two major problems when such an approach is adapted to a heterogeneous CPU-HWA system. First, HWAs miss deadlines because they are prioritized only when close to their deadlines. Second, such an approach does not consider the diverse memory access characteristics of different applications running on CPUs and HWAs, leading to low performance for latency-sensitive CPU applications and deadline misses for some HWAs, including GPUs. In this article, we propose a Deadline-Aware memory Scheduler for Heterogeneous systems (DASH), which overcomes these problems using three key ideas, with the goal of meeting HWAs' deadlines while providing high CPU performance. First, DASH prioritizes an HWA when it is not on track to meet its deadline any time during a deadline period, instead of prioritizing it only when close to a deadline. Second, DASH prioritizes HWAs over memory-intensive CPU applications based on the observation that memory-intensive applications' performance is not sensitive to memory latency. Third, DASH treats short-deadline HWAs differently as they are more likely to miss their deadlines and schedules their requests based on worst-case memory access time estimates. Extensive evaluations across a wide variety of different workloads and systems show that DASH achieves significantly better CPU performance than the best previous scheduler while always meeting the deadlines for all HWAs, including GPUs, thereby largely improving frame rates.

show abstract

“…Multi-scale detection is very challenging as the image pyramid results in a data expansion, which can be more than a 100x in HD images. The high computational complexity of object detection processing necessitates fast hardware implementations [1] to enable real-time processing. This paper presents a complete object detection accelerator using DPM [2] with a root and 8 parts model as shown in Fig.…”

Section: Introductionmentioning

confidence: 99%

A 58.6mW real-time programmable object detector with multi-scale multi-object support using deformable parts model on 1920×1080 video at 30fps

Suleiman

Zhang

Sze

2016

2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits)

View full text Add to dashboard Cite

This paper presents a programmable, energy-efficient and realtime object detection accelerator using deformable parts models (DPM), with 2x higher accuracy than traditional rigid body models. With 8 deformable parts detection, three methods are used to address the high computational complexity: classification pruning for 33x fewer parts classification, vector quantization for 15x memory size reduction, and feature basis projection for 2x reduction of the cost of each classification. The chip is implemented in 65nm CMOS technology, and can process HD (1920x1080) images at 30fps without any off-chip storage while consuming only 58.6mW (0.94nJ/pixel, 1168 GOPS/W). The chip has two classification engines to simultaneously detect two different classes of objects. With a tested high throughput of 60fps, the classification engines can be time multiplexed to detect even more than two object classes. It is energy scalable by changing the pruning factor or disabling the parts classification. Keywords: DPM, object detection, basis projection, pruning. Introduction Object detection is critical to many embedded applications that require low power and real-time processing. For example, low latency and HD images are important for autonomous control to react quickly to fast approaching objects, while low energy consumption is essential due to battery and heat limitations. Object detection involves not only classification/recognition, but also localization, which is achieved by sliding a window of a pretrained model over an image. For multi-scale detection, the window slides over an image pyramid (multiple downscaled copies of the image). Multi-scale detection is very challenging as the image pyramid results in a data expansion, which can be more than a 100x in HD images. The high computational complexity of object detection processing necessitates fast hardware implementations [1] to enable real-time processing.This paper presents a complete object detection accelerator using DPM [2] with a root and 8 parts model as shown in Fig. 1. DPM results in double the detection accuracy compared to rigid template (root only) detection. The 8 parts account for deformation such that a single model can detect objects at different poses ( Fig. 6) and increase detection confidence. However, this accuracy comes with a classification overhead of 35x more multiplications (i.e. DPM classification consumes 80% of a single detector power), making multi-object detection a challenge. A software-based DPM object detector is described in [3], which enables detection for 500x500 images at 30fps but requires a powerful fully loaded Xeon 6-core processor and 32GB of memory. In this work, the classification overhead is significantly reduced by two main techniques: Classification pruning with vector quantization (VQ) for selective part processing.  Feature basis projection for sparse multiplications.Architecture Overview Fig. 2 shows the block diagram of our detector architecture, including histogram of oriented gradients (HOG) feature pyramid gen...

show abstract

18.2 A 1.9TOPS and 564GOPS/W heterogeneous multicore SoC with color-based object classification accelerator for image-recognition applications

Cited by 19 publications

References 4 publications

Bio-inspired heterogeneous architecture for real-time pedestrian detection applications

Bio-inspired heterogeneous architecture for real-time pedestrian detection applications

Dash

A 58.6mW real-time programmable object detector with multi-scale multi-object support using deformable parts model on 1920×1080 video at 30fps

Contact Info

Product

Resources

About