PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting

Zhang, Zhen; Zhu, Wenwu; Zhang, Jinfan; Wang, Peng; Jin, Rize; Chung, Tae-Sun

doi:10.18653/v1/2022.findings-naacl.25

Cited by 13 publications

(10 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In PABEE (Zhou et al 2020), the instance exits when k consecutive internal classifiers make the same prediction. PCEE-BERT (Zhang et al 2022) combined both ensemble-based and confidence-based exiting criteria. The instance exits if the confidence scores are greater than a predefined threshold for several consecutive exits.…”

Section: Related Work Early Exitingmentioning

confidence: 99%

“…The Thirty-Eighth AAAI Conference on Artificial Intelligence the sum of cross-entropy losses. SkipBERT (Wang et al 2022), PABEE (Zhou et al 2020), Past-Future (Liao et al 2021), PCEE-BERT (Zhang et al 2022), and LeeBERT (Zhu 2021)) used weighted sum of cross entropy losses. Dee-BERT (Xin et al 2020), Right-Tool (Schwartz et al 2020), BERxiT (Xin et al 2021), and CAT (Schuster et al 2021) used sum of cross entropy losses.…”

Section: Related Work Early Exitingmentioning

confidence: 99%

“…Statistics of these datasets are listed in Table 2. We compared our method with various baselines including Dee-BERT (Xin et al 2020), PABEE (Zhou et al 2020), BERxiT (Xin et al 2021), Right-Tool (Schwartz et al 2020), PCEE-BERT (Zhang et al 2022), HashEE (Sun et al 2022), and TR-BERT (Ye et al 2021). For layer-wise exiting methods, we use saved layers to evaluate model acceleration.…”

Section: Experiments Experimental Settingsmentioning

confidence: 99%

“…2023), and knowledge distillation (Sanh et al 2019;Sun et al 2019;Jiao et al 2020). Dynamic approaches (Xu and McAuley 2023) include token skipping (Goyal et al 2020;Kim and Cho 2021;Ye et al 2021;Kim et al 2022;Guan et al 2022;Sun et al 2022), and early exiting (Zhou et al 2020;Xin et al 2021;Zhang et al 2022).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

Zeng,

Hong,

Dai

et al. 2024

AAAI

View full text Add to dashboard Cite

Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers as the objective function during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only requires each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept "Memorized Layer" to measure the hardness of an instance. We incorporate the memorized layer into reward function design, which allows "easy'' instances to focus more on acceleration while ``hard'' instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks using PLMs and LLMs as backbones respectively.

show abstract

Section: Related Work Early Exitingmentioning

confidence: 99%

Section: Related Work Early Exitingmentioning

confidence: 99%

Section: Experiments Experimental Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

Zeng,

Hong,

Dai

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…PABEE has proposed a patiencebased exit strategy that halts the forward-pass at an intermediate layer only when the pre-defined number of subsequent layers yield the same predictions. Similarly, DeeBERT and FastBERT have employed the predictive entropy to replace the patience, and PCEE-BERT (Zhang et al, 2022) has combined both patience and confidence for the exit criteria.…”

Section: Depth-wise Reduction On Transformermentioning

confidence: 99%

Leap-of-Thought: Accelerating Transformers via Dynamic Token Routing

Kim,

Park

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Computational inefficiency in transformers has been a long-standing challenge, hindering the deployment in resource-constrained or realtime applications. One promising approach to mitigate this limitation is to progressively remove less significant tokens, given that the sequence length strongly contributes to the inefficiency. However, this approach entails a potential risk of losing crucial information due to the irrevocable nature of token removal. In this paper, we introduce Leap-of-Thought (LoT), a novel token reduction approach that dynamically routes tokens within layers. Unlike previous work that irrevocably discards tokens, LoT enables tokens to 'leap' across layers. This ensures that all tokens remain accessible in subsequent layers while reducing the number of tokens processed within layers. We achieve this by pairing the transformer with dynamic token routers, which learn to selectively process tokens essential for the task. Evaluation results clearly show that LoT achieves substantial improvement on computational efficiency. Specifically, LoT attains up to 25× faster inference time without a significant loss in accuracy 1 .

show abstract

A Contrastive Self-distillation BERT with Kernel Alignment-Based Inference

Yuan

Cao

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting

Cited by 13 publications

References 13 publications

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

Leap-of-Thought: Accelerating Transformers via Dynamic Token Routing

A Contrastive Self-distillation BERT with Kernel Alignment-Based Inference

Contact Info

Product

Resources

About