Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations

Kung, H. T.; McDanel, Bradley; Zhang, Sai Qian

doi:10.1145/3297858.3304028

Cited by 125 publications

(86 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike previous work, column combining is a new pruning method which allows for sparse CNN layers, but requires that the remaining sparse weights can be packed into a denser format when deployed in hardware [27]. In our proposed training pipeline, we use column combining in addition to weight and data quantization as discussed in the previous section, in order to achieve efficient sparse CNN inference.…”

Section: Weight Pruningmentioning

confidence: 99%

“…--8 Rows Figure 3: A pointwise convolution layer (left) with four channels per group resulting from weight pruning training for column combining [27]. After combining columns in the filter matrix (left), each group of four channels (shown in cream and green) are reduced into a single column (right).…”

Section: Layer As Stored In Systolic Arraymentioning

confidence: 99%

“…Column groups controls the degree of sparsity in the pointwise convolution layer (e.g., g=4 means ~25% of weights are nonzero) Figure 4: Each layer of the evaluation CNN models in this paper consists of a shift operation [44], pointwise (1x1) convolution, batch normalization and ReLU activation. A layer is parameterized with a number of filters (f), a stride (s), and column groups (g) for column combining in packing a sparse convolutional layer [27].…”

Section: Batch Normalizationmentioning

confidence: 99%

“…The output of the shift operation is then applied to a sparse pointwise convolution layer, followed by batch normalization and rectified linear unit (ReLU). During training, the weights in the pointwise convolution layer are pruned with column combining using the column groups parameter (g) as in [27]. For the earlier convolution layers in a network which have fewer weights, a column group size of 2 is used, which reduces the number of nonzero weights by roughly 50%.…”

Section: Batch Normalizationmentioning

confidence: 99%

“…When a layer has more channels than the systolic array can handle, we partition the layer into horizontal tiles across channels (these horizontal tiles are not shown in the figure). In this paper, with column combining [27], the 128x64 systolic array implemented on our FPGA is large enough to handle all channels in each layer of evaluation CNN models. Thus we do not use horizontal tiles.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validation

McDanel

Zhang

Kung

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. As a validation vehicle, we have implemented a 170MHz FPGA inference chip achieving 2.28ms latency for the ImageNet benchmark. The achieved latency is among the lowest reported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. A highlight of our full-stack approach which attributes to the achieved high energy efficiency is an efficient Selector-Accumulator (SAC) architecture for implementing the multiplier-accumulator (MAC) operation present in any digital CNN hardware. For instance, compared to a FPGA implementation for a traditional 8-bit MAC, SAC substantially reduces required hardware resources (4.85x fewer Look-up Tables) and power consumption (2.48x).

show abstract

Section: Weight Pruningmentioning

confidence: 99%

Section: Layer As Stored In Systolic Arraymentioning

confidence: 99%