Laconic deep learning inference acceleration

Sharify, Sayeh; Lascorz, Alberto Delmás; Mahmoud, Magdi S.; Nikolić, Miloš; Siu, Kevin; Stuart, Dylan Malone; Poulos, Zissis; Moshovos, Andreas

doi:10.1145/3307650.3322255

Cited by 83 publications

(28 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[29] solves the problem of reducing the number of channels in the compression training process by combining or splitting multiple small arrays. [20], [9] Gope et al and Sharife et al [37] deploy low-precision networks and use bit-wise calculations to achieve low-power designs. However, none of these designs consider the emerging special types of convolution, including the small-scale convolution and DWConv.…”

Section: Related Workmentioning

confidence: 99%

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Wang

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.

show abstract

Section: Related Workmentioning

confidence: 99%

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Wang

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Static pruning removes weights in the offline training stage and apply the same compressed model to all samples. Unstructured pruning [9,11,30,37] targets removing individual weights with minimal contribution. The limitation of the unstructured weight pruning is that dedicated hardware and libraries [37] are needed to achieve speedup from the compression.…”

Section: Related Workmentioning

confidence: 99%

“…Unstructured pruning [9,11,30,37] targets removing individual weights with minimal contribution. The limitation of the unstructured weight pruning is that dedicated hardware and libraries [37] are needed to achieve speedup from the compression. Structured pruning is becoming a more practical solution where filters or blocks are ranked and pruned based on a criterion [7,13,15,27,32,34].…”

Section: Related Workmentioning

confidence: 99%

Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction

Elkerdawy¹,

Elhoushi²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Dynamic model pruning is a recent direction that allows for the inference of a different sub-network for each input sample during deployment. However, current dynamic methods rely on learning a continuous channel gating through regularization by inducing sparsity loss. This formulation introduces complexity in balancing different losses (e.g task loss, regularization loss). In addition, regularization based methods lack transparent tradeoff hyperparameter selection to realize computational budget. Our contribution is two fold: 1) decoupled task and pruning training. 2) Simple hyperparameter selection that enables FLOPs reduction estimation before training. Inspired by the Hebbian theory in Neuroscience: "neurons that fire together wire together", we propose to predict a mask to process k filters in a layer based on the activation of its previous layer. We pose the problem as a self-supervised binary classification problem. Each mask predictor module is trained to predict if the log-likelihood for each filter in the current layer belongs to the top-k activated filters. The value k is dynamically estimated for each input based on a novel criterion using the mass of heatmaps. We show experiments on several neural architectures, such as VGG, ResNet and MobileNet on CIFAR and ImageNet datasets. On CIFAR, we reach similar accuracy to SOTA methods with 15% and 24% higher FLOPs reduction. Similarly in ImageNet, we achieve lower drop in accuracy with up to 13% improvement in FLOPs reduction.

show abstract

“…The second is call-based data selection, where an organization uses a phone company's caller ID system to identify a caller. 24) "Most multiprocessor scheduling problems are NP, but for deterministic scheduling this is not a major problem. We can use a polynomial algorithm and develop an optimal schedule if the specific problem is not NP-complete, or we can use off-line heuristic search techniques based on classical theory implications.…”

Section: June 1995mentioning

confidence: 99%

“…Tartan combines an inference optimization software stack run on a custom hardware platform and leverages an on-chip compression engine to squeeze out network sparsity, reduce memory traffic, and yield a 40× inference execution acceleration and an order of magnitude power reduction across a wide selection of neural networks. 23,24 The latest development for inference acceleration platforms comes as an offshoot of the optical quantum computing world. Silicon photonics uses silicon as an optical medium to guide photons, much like electronics uses conductors to guide electrical signals.…”

Section: Exciting Newcomersmentioning

confidence: 99%

Reliability Inversion: A Cautionary Tale

Parhami

2020

Computer

View full text Add to dashboard Cite

Computer solicits special issue proposals from lead experts. Proposed themes/issues should address timely, emerging topics that will be of broad interest to Computer's readership. Special issues are an important component of Computer, as they deliver essential research insights and well-developed perspectives on new and established technologies and computing strategies. We encourage submissions of high-quality proposals for the 2021 editorial calendar. Of particular interest are proposals centered on: • offsite educational and business continuity technology challenges, • privacy related to personal location tracking and surveillance (digital and physical), • artificial intelligence and machine learning, • technology's role in disrupted supply chains, • misinformation and disinformation (fake information-malicious or non-malicious), and • cyberwarfare/cyberterrorism

show abstract

Laconic deep learning inference acceleration

Cited by 83 publications

References 55 publications

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction

Reliability Inversion: A Cautionary Tale

Contact Info

Product

Resources

About