QPyTorch: A Low-Precision Arithmetic Simulation Framework

Zhang, Tianyi; Lin, Zhiqiu; Yang, Guandao; De, Christopher

doi:10.48550/arxiv.1910.04540

Cited by 3 publications

(3 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A. Platform, Datasets, and Models PyTorch and QPyTorch [10] were used as the frameworks to study the proposed method. Four commonly used NLP datasets UDPOS [11], SNLI [12], Multi30K [13], and WikiText-2 [14] were used in the simulations.…”

Section: Simulation and Discussionmentioning

confidence: 99%

Low-Complexity LSTM Training and Inference with FloatSD8 Weight Representation

Liu¹,

Chiueh²

2020

Preprint

View full text Add to dashboard Cite

The FloatSD technology has been shown to have excellent performance on low-complexity convolutional neural networks (CNNs) training and inference. In this paper, we applied FloatSD to recurrent neural networks (RNNs), specifically long short-term memory (LSTM). In addition to FloatSD weight representation, we quantized the gradients and activations in model training to 8 bits. Moreover, the arithmetic precision for accumulations and the master copy of weights were reduced from 32 bits to 16 bits. We demonstrated that the proposed training scheme can successfully train several LSTM models from scratch, while fully preserving model accuracy. Finally, to verify the proposed method's advantage in implementation, we designed an LSTM neuron circuit and showed that it achieved significantly reduced die area and power consumption.

show abstract

Section: Simulation and Discussionmentioning

confidence: 99%

Low-Complexity LSTM Training and Inference with FloatSD8 Weight Representation

Liu¹,

Chiueh²

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…We slightly modified the self attention layers in Longformer and ViL by inserting quantization layers after the operators, to simulate the precision on SALO. These quantization layers are implemented by QPyTorch [17], a low-precision arithmetic simulation package. We perform quantization-aware finetuning on both pretrained models.…”

Section: Impact Of Quantizationmentioning

confidence: 99%

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

Guan¹,

Zhao²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.

show abstract

“…To extensively evaluate pure 16-bit training with stochastic rounding and Kahan summation, we additionally consider larger datasets and more applications: ResNet-50 on the ImageNet [34], BERT-Base on the Wiki103 language model 2 [35], DLRM model on the Criteo Terabyte dataset [36], and Deepspeech2 [20] on the LibriSpeech datasets [37]. As there is no publicly available accelerator with the software and hardware support necessary for our study, we simulate pure 16-bit training using the QPyTorch simulator [38]. QPyTorch models PyTorch kernels such as matrix multiplication as compute graph operators, and effectively simulates FMAC units with 32-bit accumulators 3 .…”

Section: Experiments In Deep Learningmentioning

confidence: 99%

Revisiting BFloat16 Training

Zamirai,

Zhang,

Aberger

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit precision alone is not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit compute units which is more costly than only using 16-bit units for hardware design. We ask can we do pure 16-bit training which requires only 16-bit compute units, while still matching the model accuracy attained by 32-bit training. Towards this end, we study pure 16-bit training algorithms on the widely adopted BFloat16 compute unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates can often cancel small updates, which degrades the convergence and model accuracy. Motivated by this, we identify two simple existing techniques, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in pure 16-bit training. We empirically show that these two techniques can enable up to 7% absolute validation accuracy gain in pure 16-bit training. This leads to 0.1% lower to 0.2% higher matching validation accuracy compared to 32-bit precision training across seven deep learning applications.

show abstract

QPyTorch: A Low-Precision Arithmetic Simulation Framework

Cited by 3 publications

References 4 publications

Low-Complexity LSTM Training and Inference with FloatSD8 Weight Representation

Low-Complexity LSTM Training and Inference with FloatSD8 Weight Representation

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

Revisiting BFloat16 Training

Contact Info

Product

Resources

About