“…Quantization (Low Precision Inference): A common solution is to compress NN models with quantization (Asanovic and Morgan, 1991 ; Hubara et al, 2016 ; Rastegari et al, 2016 ; Zhou et al, 2016 , 2017 ; Cai et al, 2017 , 2020b ; Choi et al, 2018 ; Jacob et al, 2018 ; Zhang et al, 2018a ; Dong et al, 2019 ; Wang et al, 2019c ; Chin et al, 2020 ; Gholami et al, 2021 ), where low bit-precision is used for weights/activations. A notable work here is Deep Compression (Han et al, 2016 ), which used quantization to compress the model footprint of the SqueezeNet model discussed above, bringing its size to 500x smaller than AlexNet.…”