Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Wu, Bichen; Wan, Alvin; Yue, Xiangyu; Jin, Peter H.; Zhao, Sicheng; Golmant, Noah; Gholaminejad, Amir; Keutzer, Kurt

doi:10.48550/arxiv.1711.08141

Cited by 27 publications

(46 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SqueezeNext [10] uses a hardware simulator to adjust the macro-architecture of the network for better efficiency. ShiftNet [11] proposes a hardwarefriendly shift operator to replace expensive spatial convolutions. AddressNet [21] designed three shift-based primitives to accelerate GPU inference.…”

Section: Background 21 Efficient Convnet Modelsmentioning

confidence: 99%

“…The motivation is that smaller convolution kernel sizes require less reuse of the feature map, resulting in simpler data movement schedule, control flow, and timing constraint. As pointed out by [11], ConvNets rely on spatial convolutions (3×3 convolutions and 3×3 depth-wise convolutions) to aggregate spatial information from neighboring pixels to the center position. However, spatial convolutions can be replaced by a more efficient operator called shift.…”

Section: Diracdeltanetmentioning

confidence: 99%

“…In the downsample block, we directly replace the strided 3×3 depthwise convolutions with a stride-2 2×2 max-pooling. Unlike [11], our shift operation only uses 4 cardinal directions (up, down, left, right) in addition to the identity mapping (no-shift). This simplifies our hardware implementation of the shift operation without hurting accuracy.…”

Section: Diracdeltanetmentioning

confidence: 99%

“…With an inefficient model, an accelerator with high throughput in terms of GOPs can actually have low inference speed in terms of FPS, where FPS is the more essential metric of efficiency. To achieve AlexNet-level accuracy, SqueezeNet [9] is 50x smaller than AlexNet; SqueezeNext [10] is 112x smaller; ShiftNet-C [11], with 1.6% higher accuracy, is 77x smaller. However, not many designs target those efficient models.…”

Section: Introductionmentioning

confidence: 99%

“…Our co-design approach produces a novel ConvNet architecture DiracDeltaNet that is based on ShuffleNetV2 [13], one of the stateof-the-art efficient models with small model size, low FLOP counts, hardware friendly skip connections, and competitive accuracy. We optimize the network by replacing all 3×3 convolutions with shift operations [11] and 1×1 convolution, enabling us to implement a compute unit customized for 1×1 convolutions for better efficiency. The name "DiracDeltaNet" comes from the fact that the network only convolves input feature maps with 1×1 kernels.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

Yang,

Huang,

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet † . Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only 1 × 1 convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7% top-5), but with 42× fewer parameters and 48× fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and activations to 4-bits, with less than 1% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator's final top-5 accuracy of 88.1% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 96.5 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 16.9×.

show abstract

Section: Background 21 Efficient Convnet Modelsmentioning

confidence: 99%

Section: Diracdeltanetmentioning

confidence: 99%

Section: Diracdeltanetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

Yang,

Huang,

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Motion Feature Network: Fixed Motion Filter for Action Recognition

Lee

Son

et al. 2018

Lecture Notes in Computer Science

135

View full text Add to dashboard Cite

Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatiotemporal information between adjacent frames in a unified network that can be trained end-to-end. The motion block can be attached to any existing CNN-based action recognition frameworks with only a small additional cost. We evaluated our network on two of the action recognition datasets (Jester and Something-Something) and achieved competitive performances for both datasets by training the networks from scratch.

show abstract

MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices

Chen

Liu

Gao

et al. 2018

Lecture Notes in Computer Science

544

289

View full text Add to dashboard Cite

We present a class of extremely efficient CNN models, MobileFaceNets, which use less than 1 million parameters and are specifically tailored for high-accuracy real-time face verification on mobile and embedded devices. We first make a simple analysis on the weakness of common mobile networks for face verification. The weakness has been well overcome by our specifically designed MobileFaceNets. Under the same experimental conditions, our MobileFaceNets achieve significantly superior accuracy as well as more than 2 times actual speedup over MobileNetV2. After trained by ArcFace loss on the refined MS-Celeb-1M, our single MobileFaceNet of 4.0MB size achieves 99.55% accuracy on LFW and 92.59% TAR@FAR1e-6 on MegaFace, which is even comparable to state-of-the-art big CNN models of hundreds MB size. The fastest one of MobileFaceNets has an actual inference time of 18 milliseconds on a mobile phone. For face verification, MobileFaceNets achieve significantly improved efficiency over previous state-of-the-art mobile CNNs.MegaFace show that our MobileFaceNets achieve significantly improved efficiency over previous state-of-the-art mobile CNNs for face verification. Related WorkTuning deep neural architectures to strike an optimal balance between accuracy and performance has been an area of active research for the last several years [3]. For common visual recognition tasks, many efficient architectures have been proposed recently [1,2,3,9]. Some efficient architectures can be trained from scratch. For example, SqueezeNet ([9]) uses a bottleneck approach to design a very small network and achieves AlexNet-level [10] accuracy on ImageNet [11, 12] with 50x fewer parameters (i.e., 1.25 million). MobileNetV1 [1] uses depthwise separable convolutions to build lightweight deep neural networks, one of which, i.e., MobileNet-160 (0.5x), achieves 4% better accuracy on ImageNet than SqueezeNet at about the same size. ShuffleNet [2] utilizes pointwise group convolution and channel shuffle operation to reduce computation cost and achieve higher efficiency than MobileNetV1. MobileNetV2 [3] architecture is based on an inverted residual structure with linear bottleneck and improves the state-of-the-art performance of mobile models on multiple tasks and benchmarks. The mobile NASNet [13] model, which is an architectural search result with reinforcement learning, has much more complex structure and much more actual inference time on mobile devices than MobileNetV1, ShuffleNet, and MobileNetV2. However, these lightweight basic architectures are not so accurate for face verification when trained from scratch (see Table 2). Accurate lightweight architectures specifically designed for face verification have been rarely researched. [14] presents a light CNN framework to learn a compact embedding on the large-scale face data, in which the Light CNN-29 model achieves 99.33% face verification accuracy on LFW with 12.6 million parameters. Compared with MobileNetV1, Light CNN-29 is not lightweight for mobile and embedded platform. Light CNN-4 and Ligh...

show abstract

Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Cited by 27 publications

References 14 publications

Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

Motion Feature Network: Fixed Motion Filter for Action Recognition

MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices

Contact Info

Product

Resources

About