Visual sensors combined with video analysis algorithms can enhance applications in surveillance, healthcare, intelligent vehicle control, human-machine interfaces, etc. Hardware solutions exist for video analysis. Analog on-sensor processing solutions [1] feature image sensor integration. However, the precision loss of analog signal processing prevents those solutions from realizing complex algorithms, and they lack flexibility. Vision processors [2,3] realize high GOPS numbers by combining a processor array for parallel operations and a decision processor for other ones. Converting from parallel data in the processor array to scalar in the decision processor creates a throughput bottleneck. Parallel memory accesses also lead to high power consumption. Privacy is a critical issue in setting up visual sensors because of the danger of revealing video data from image sensors or processors. These issues exist with the above solutions because inputting or outputting video data is inevitable.iVisual is characterized as follows: 1) Privacy is protected by integrating 2790fps CMOS Image Sensor, 76.8GOPS vision processor and 1Mb storage. It is a light-in-answer-out SoC, and no video data need to be revealed outside the chip. 2) Feature processor eliminates the throughput bottleneck and increases throughput 36%. 3) The 205GOPS/W power efficiency is 5× better than previous works [2,3] and is achieved by introducing a feature processor, a gatedclock scheme and by reducing memory accesses. and decision processor (DP). GP is a parallel data in, parallel data out processor and controls the bitplane memory. FP is a parallel data-in, scalar-out processor and therefore eliminates the throughput bottleneck of data conversion. The DP processes scalar-in, scalar-out operations, that are usually decision results that further control the program execution of the GP and FP.The CIS is frame-pipelined with GP, FP and DP to increase hardware utilization. The port of bitplane memory is shared by CIS and GP; port collision is automatically handled. The port sharing of bitplane memory reduces SRAM area 64% and die area 16% with average collision probability below 0.1%. GP, FP and DP work concurrently. For each instruction, the availability of required resources is checked, including resources in other processors. An instruction will be executed only when all required resources are available. This simple scheme ensures minimum inter-processor communication to synchronize the three processors and increases throughput 23% compared with tightly-coupled processors [2]. The clocks of unused resources are turned off to reduce power. The GP execution unit is a SIMD processor array with 128 processing elements (PEs). The PE cache lies between the PE array and bitplane memory to reduce memory access 94%, saving 726mW of power. The PE cache itself consumes 134mW. Various bitplane memory access patterns and storage allocation schemes are provided to reduce the program size and increase storage density. To enhance flexibility, each PE is indexed and has...