Block-Skim: Efficient Question Answering for Transformer

Guan, Yong Liang; Li, Zhengyi; Leng, Jingwen; Lin, Zhouhan; Guo, Minyi; Zhu, Yuhao

doi:10.48550/arxiv.2112.08560

Cited by 3 publications

(3 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The context branch is preprocessed off-line and pruned at shallow layers. Also dedicated for QA tasks, Block-Skim (Guan et al, 2021) proposes to predict and skim the irrelevant context blocks by analyzing the attention weight patterns. Progressive Growth (Gu et al, 2021) randomly drops a portion of input tokens during training to achieve better pre-training efficiency.…”

Section: Related Workmentioning

confidence: 99%

Transkimmer: Transformer Learns to Layer-wise Skim

Guan¹,

Li²,

Leng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformer architecture has become the defacto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97× average speedup on GLUE benchmark compared with vanilla BERT base baseline with less than 1% accuracy degradation.

show abstract

Section: Related Workmentioning

confidence: 99%

Transkimmer: Transformer Learns to Layer-wise Skim

Guan¹,

Li²,

Leng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, pruning will cause sparse irregular memory accesses. Therefore, pruning needs software (Gale et al, 2020;Guan et al, 2020;Qiu et al, 2019;Guo et al, 2020a;Guan et al, 2021;Fedus et al, 2021) and hardware (Gondimalla et al, 2019;Guo et al, 2020b;Zhang et al, 2020; optimization to accelerate.…”

Section: Related Workmentioning

confidence: 99%

SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation

Cong¹,

Qiu²,

Leng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive and confidential scenarios. However, current DFQ solutions degrade accuracy, need synthetic data to calibrate networks, and are time-consuming and costly. This paper proposes an on-the-fly DFQ framework with sub-second quantization time, called SQuant, which can quantize networks on inference-only devices with low computation and memory requirements. With the theoretical analysis of the second-order information of DNN task loss, we decompose and approximate the Hessian-based optimization objective into three diagonal sub-items, which have different areas corresponding to three dimensions of weight tensor: element-wise, kernel-wise, and output channel-wise. Then, we progressively compose sub-items and propose a novel data-free optimization objective in the discrete domain, minimizing Constrained Absolute Sum of Error (or CASE in short), which surprisingly does not need any dataset and is even not aware of network architecture. We also design an efficient algorithm without back-propagation to further reduce the computation complexity of the objective solver. Finally, without fine-tuning and synthetic datasets, SQuant accelerates the data-free quantization process to a sub-second level with > 30% accuracy improvement over the existing data-free post-training quantization works, with the evaluated models under 4-bit quantization. We have open-sourced the SQuant framework 1 .

show abstract

“…The principles of attaching exits vary. The mechanisms in [25,37,49,54] directly place exits after each block of the transformer model, assuming the overhead of exits is small. In [12,22,29,30,43], the placement of exits is hand-crafted and depends on the model architecture.…”

Section: Multi-exit Dnn Modelsmentioning

confidence: 99%

Pame

Zhang

Cui

Chen

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

In emerging DNN serving systems, queries are usually batched to fully leverage hardware resources, and all the queries in a batch run through the complete model and return at the same time. According to our findings, some queries only need to pass through a portion of the DNN model to attain sufficient precision in a DNN service. These queries can have shorter latencies if they can return early in the middle of a model. Therefore, we propose precision-aware multiexit inference serving, PAME, to achieve the above purpose. PAME provides a holistic scheme to build a multi-exit DNN model and a corresponding system-level design of the inference engine. We use representative CV and NLP benchmarks to evaluate PAME. PAME is adaptive to various DNN tasks and service loads. Experimental results show that PAME reduces 39.9% average latency without increasing the tail latency, while maintaining 99.68% precision of the original single-exit DNN models on average. CCS CONCEPTS• Computing methodologies → Artificial intelligence; • Computer systems organization → Real-time systems.

show abstract

Block-Skim: Efficient Question Answering for Transformer

Cited by 3 publications

References 25 publications

Transkimmer: Transformer Learns to Layer-wise Skim

Transkimmer: Transformer Learns to Layer-wise Skim

SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation

Pame

Contact Info

Product

Resources

About