Image recognition technologies have gained prominence in a variety of fields, such as automotive and surveillance, with dedicated image-recognition ICs being developed recently [1][2]. Image recognition ICs for an advanced driver assistance system (ADAS) have also been proposed [3]. However, future ADAS applications must support greater numbers of real-time recognition processes simultaneously, with higher detection rates and lower false-positive rates. For instance, adaptive cruise control (ACC), an application of ADAS, comprises many image recognition processes, such as pedestrian detection (PD), vehicle detection (VD), general obstacle detection (GOD), lane detection (LD), traffic light recognition (TLR), and traffic sign recognition (TSR). ACC also requires high detection accuracy to prevent unnecessary braking or acceleration. To satisfy these requirements, we have developed an SoC with two 4-core processor clusters and 14 hard-wired accelerators. It is designed to realize the six recognition processes (PD, VD, GOD, LD, TLR, and TSR) for ACC and automatic high beam (AHB) for headlight control. It achieves 1.9TOPS peak performance in 3.37W. This low power consumption enables the SoC to operate with passive cooling in a high-temperature automotive environment. Figure 18.2.1 shows a block diagram of the SoC. It comprises two 4-core processor clusters and 14 hard-wired accelerators of which there are eight variants, two RISC cores, 1.5MB on-chip SRAM, two LPDDR2 interfaces (I/F), four video inputs, and two video outputs. Considering the product lifespan and high quality requirements of automotive systems, we select LPDDR2 instead of LPDDR3. Two 64b LPDDR2 I/Fs provide 12.8GB/s bandwidth. In order to improve flexibility, two low-power 4-core processor clusters [3] are integrated, based on a VLIW architecture with a SIMD instruction set. Each processor core can execute up to three slots of a VLIW instruction every cycle. As floating point is becoming essential in image recognition, we have added a double precision floating-point execution unit to each processor. Each processor core can execute up to two SIMD integer instructions (where each instruction is 8-parallel 8b) or two double-precision floating-point instructions, such as addition and multiplication instructions. These instructions can be executed with an initiation interval of one cycle. The following eight hard-wired accelerators are included to enhance the performance of image recognition processes: 1) a CoHOG (co-occurrence histograms of oriented gradients) accelerator [3-4] for object classification; 2) a histogram accelerator for making histograms from images; 3) two 64-parallel SIMD filter accelerators for various filtering tasks, such as noise reduction and gradient-image generation; 4) three affine accelerators for image transformation and lens calibration tasks; 5) two matching accelerators for stereo matching and tracking tasks; 6) two pyramid accelerators for making pyramid images; 7) a structure-from-motion (SfM) accelerator; and, 8) two accel...