Non-Blocking Simultaneous Multithreading: Embracing the Resiliency of Deep Neural Networks

Shomron, Gil; Weiser, Uri

doi:10.1109/micro50266.2020.00032

Cited by 11 publications

(11 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SySMT [28] leverages sparsity in quantization of both activations and weights to 4 bits. Their method incurs relatively high area overheads, since the quantization logic has to be scaled with the number of processing units.…”

Section: Related Workmentioning

confidence: 99%

“…Also, consider a single MAC unit that computes a single activation-weight multiplication per cycle. vSPARQ, similar to [28,30], groups activations in pairs, to leverage the dynamic and unstructured activation sparsity. That is, the DP calculations can be formulated as:…”

Section: Vsparq: Leveraging Sparsity With Pairs Of Activationsmentioning

confidence: 99%

“…Moreover, inspired by [28], we also leverage the entire 8-bit activation sparsity with vSPARQ, for additional mitigation of quantization noise. Instead of quantizing activation-by-activation to 4 bits, activations are quantized to 4 bits in pairs.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Post-Training Sparsity-Aware Quantization

Shomron¹,

Gabbay²,

Kurzum³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation, 2× speedup over widely used hardware architectures, and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Vsparq: Leveraging Sparsity With Pairs Of Activationsmentioning

confidence: 99%

See 1 more Smart Citation

Post-Training Sparsity-Aware Quantization

Shomron¹,

Gabbay²,

Kurzum³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Inspired by conventional SMT, Shomron and Weiser [18] propose non-blocking SMT (NB-SMT) designated for deep neural networks (DNNs). NB-SMT mitigates hardware underutiliation of DNNs caused by unstructured sparsity.…”

Section: Introductionmentioning

confidence: 99%

“…The importance of recalibrating the BatchNorm layers due to an external corrupted dataset or internal perturbations in activations and/or weights has been discussed in other works as well. Schneider et al [16] show how BatchNorm recalibration can improve model robustness of vision models to image corruptions (e.g., blurring and compression artifacts); Tsai et al [25] propose to recalibrate the BatchNorm layers in the scenario of noise in analog accelerators; Shomron et al [20] recalibrate the BatchNorm layers to redeem some of the accuracy degradation due to zero-valued activation mispredictions; and Hubara et al [9], as well as Shomron and Weiser [18], suggest post-quantization BatchNorm recalibration. Other works [4,10,21,22] tackle the problem of mitigating BatchNorm training and inference discrepancy due to relatively small batch sizes.…”

Section: Introductionmentioning

confidence: 99%

Post-Training BatchNorm Recalibration

Shomron,

Weiser

2020

Preprint

Self Cite

View full text Add to dashboard Cite

We revisit non-blocking simultaneous multithreading (NB-SMT) introduced previously by Shomron and Weiser [18]. NB-SMT trades accuracy for performance by occasionally "squeezing" more than one thread into a shared multiply-andaccumulate (MAC) unit. However, the method of accommodating more than one thread in a shared MAC unit may contribute noise to the computations, thereby changing the internal statistics of the model. We show that substantial model performance can be recouped by post-training recalibration of the batch normalization layers' running mean and running variance statistics, given the presence of NB-SMT.

show abstract