Accelerating Transformer Inference for Translation via Parallel Decoding

Santilli, Andrea; Severino, Silvio; Emilian, Postolache,; Valentino, Maiorca,; Michele, Mancusi,; Marin, Riccardo; Rodolà, Emanuele

doi:10.18653/v1/2023.acl-long.689

Cited by 8 publications

(3 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among the new problems, a noteworthy element is the inference efficiency: the comparison with the standard methods -which typically rely on models of limited size (100-300M parameters) -should account for this aspect, which is critical for social, economic, and environmental reasons (Strubell et al, 2019). Along this line, important research directions include i) pruning the LLM (and possibly the SFM) in a task-aware manner Zhu et al, 2023b;Dery et al, 2024), ii) dynamic layer selection during decoding (Xin et al, 2020;Geva et al, 2022;Xia et al, 2024), and iii) efficient decoding strategies (Stern et al, 2018;Chen et al, 2023a;Leviathan et al, 2023;Santilli et al, 2023). In addition, the speech source contains a wide range of information that can be exploited depending on the paradigm used (e.g., prosody is not handled by cascade systems -Zhou et al 2024).…”

Section: What Is Missing?mentioning

confidence: 99%

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Gaido¹,

Negri²,

Turchi³

2022

ijcol

View full text Add to dashboard Cite

Direct speech-to-text translation (ST) is an emerging approach that consists in performing the ST task with a single neural model. Although this paradigm comes with the promise to outperform the traditional pipeline systems, its rise is still limited by the paucity of speech-translation paired corpora compared to the large amount of speech-transcript and parallel bilingual corpora available to train previous solutions. As such, the research community focused on techniques to transfer knowledge from automatic speech recognition (ASR) and machine translation (MT) models trained on huge datasets. In this paper, we extend and integrate our recent work (Gaido et al. 2020b) analysing the best performing approach to transfer learning from MT, which is represented by knowledge distillation (KD) in sequence-to-sequence models. After the comparison of the different KD methods to understand which one is the most effective, we extend our previous analysis of the effects -both in terms of benefits and drawbacks -to different language pairs in high-resource conditions, ensuring the generalisability of our findings. Altogether, these extensions complement and complete our investigation on KD for speech translation leading to the following overall findings: i) the best training recipe involves a word-level KD training followed by a fine-tuning step on the ST task, ii) word-level KD from MT can be detrimental for gender translation and can lead to output truncation (though these problems are alleviated by the fine-tuning on the ST task), and iii) the quality of the ST student model strongly depends on the quality of the MT teacher model, although the correlation is not linear.

show abstract

Section: What Is Missing?mentioning

confidence: 99%

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Gaido¹,

Negri²,

Turchi³

2022

ijcol

View full text Add to dashboard Cite

show abstract

“…Distill-Spec (Zhou et al, 2023) further investigated the efficacy of knowledge distillation in enhancing the alignment between the target model and the drafter in speculative decoding. In addition to employing additional models as drafters, there has also been some research that proposes various strategies to efficiently generate drafts from the LLM itself (Santilli et al, 2023;. All the following research strongly backs up the value of this original work.…”

Section: Related Workmentioning

confidence: 94%

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Xia,

Ge,

Wang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

We propose Speculative Decoding (SpecDec), for the first time ever 1 , to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -an independent model specially optimized for efficient and accurate drafting -and Spec-Verification -a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around 5× speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only 1.4×∼2× speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/ hemingkx/SpecDec.

show abstract

“…The non-autoregressive decoding, which generates multiple output tokens in parallel, was initially proposed by Gu et al (2018). Several works (Ghazvininejad et al, 2019;Gu and Kong, 2021;Savinov et al, 2022;Santilli et al, 2023) have since focused on enhancing generation quality in machine translation tasks. Subsequently, Leviathan et al (2023) introduced speculative decoding for sequence generation tasks.…”

Section: Parallel Decodingmentioning

confidence: 99%

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

Bae,

Ko,

Song

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.

show abstract

Accelerating Transformer Inference for Translation via Parallel Decoding

Cited by 8 publications

References 37 publications

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

Contact Info

Product

Resources

About