FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

Yang, Lin; Zhang, Tianyu; Sun, Peng; Li, Zheng; Zhou, Shuchang

doi:10.48550/arxiv.2111.13824

Cited by 2 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FQ-ViT [43] adopts log2 quantization (see eq.9) and assigns more bins to the frequently occurring small values found in post-softmax activation (attention maps), in contrast to the 4-bit uniform quantization, which allocates only one bin for these values.…”

Section: A Activation Quantization Optimizationmentioning

confidence: 99%

“…2) Post-LayerNorm Activation: FQ-ViT [43] proposes the Power-of-Two Factor (PTF) for Pre-LayerNorm quantization. The core idea of PTF is to assign different factors to different channels instead of different quantization parameters.…”

Section: A Activation Quantization Optimizationmentioning

confidence: 99%

“…4) Outlier-Aware Training: PackQViT [69] introduces an outlier-aware training approach, differing from the method in [43] which calculates the power-of-two factor during calibration. In each training iteration of PackQViT, the process begins by searching for the channel index and the power-oftwo coefficient of each outlier using l 2 minimization.…”

Section: Gradient-base Optimization For Qatmentioning

confidence: 99%

“…FQ-ViT [43] introduces "Log-Int-Softmax," an integeronly variant of the softmax function that approximates the exponential component using a second-order polynomial [74]. This method is coupled with Log2 quantization for efficient computation.…”

Section: A Accelerating Non-linear Operationsmentioning

confidence: 99%

“…This method employs log2 quantization post-exponentiation and an approximate logarithmic division, eschewing traditional high-precision operations. Additionally, SOLE employs dynamic compression for Layer Normalization statistics via Power-of-Two Factor [43], and innovatively designs a two-stage LayerNorm Unit. This design facilitates pipelined computations and flexible batch processing, with robust error tolerance and reduced buffering and multiplication.…”

Section: A Accelerating Non-linear Operationsmentioning

confidence: 99%

See 4 more Smart Citations

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

et al. 2020

View full text Add to dashboard Cite

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related opensource materials at https://github.com/DD-DuDa/awesome-vitquantization-acceleration.

show abstract

Section: A Activation Quantization Optimizationmentioning

confidence: 99%

Section: A Activation Quantization Optimizationmentioning

confidence: 99%

Section: Gradient-base Optimization For Qatmentioning

confidence: 99%

Section: A Accelerating Non-linear Operationsmentioning

confidence: 99%

Section: A Accelerating Non-linear Operationsmentioning

confidence: 99%

See 3 more Smart Citations

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

et al. 2020

View full text Add to dashboard Cite

show abstract

Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization

Lin

Shi

et al. 2022

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.

show abstract

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

Cited by 2 publications

References 28 publications

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization

Contact Info

Product

Resources

About