Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators.
We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6 × and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.