Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2023
DOI: 10.18653/v1/2023.acl-long.689
|View full text |Cite
|
Sign up to set email alerts
|

Accelerating Transformer Inference for Translation via Parallel Decoding

Abstract: Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the MT model, trading inference speed at the cost of the translation quality. In this paper, we propose to address the problem from the point of view of decoding algorithms, as a less explored but rather compelling direction. We propose to reframe the standard greedy autoregres… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(3 citation statements)
references
References 37 publications
0
3
0
Order By: Relevance
“…Among the new problems, a noteworthy element is the inference efficiency: the comparison with the standard methods -which typically rely on models of limited size (100-300M parameters) -should account for this aspect, which is critical for social, economic, and environmental reasons (Strubell et al, 2019). Along this line, important research directions include i) pruning the LLM (and possibly the SFM) in a task-aware manner Zhu et al, 2023b;Dery et al, 2024), ii) dynamic layer selection during decoding (Xin et al, 2020;Geva et al, 2022;Xia et al, 2024), and iii) efficient decoding strategies (Stern et al, 2018;Chen et al, 2023a;Leviathan et al, 2023;Santilli et al, 2023). In addition, the speech source contains a wide range of information that can be exploited depending on the paradigm used (e.g., prosody is not handled by cascade systems -Zhou et al 2024).…”
Section: What Is Missing?mentioning
confidence: 99%
“…Among the new problems, a noteworthy element is the inference efficiency: the comparison with the standard methods -which typically rely on models of limited size (100-300M parameters) -should account for this aspect, which is critical for social, economic, and environmental reasons (Strubell et al, 2019). Along this line, important research directions include i) pruning the LLM (and possibly the SFM) in a task-aware manner Zhu et al, 2023b;Dery et al, 2024), ii) dynamic layer selection during decoding (Xin et al, 2020;Geva et al, 2022;Xia et al, 2024), and iii) efficient decoding strategies (Stern et al, 2018;Chen et al, 2023a;Leviathan et al, 2023;Santilli et al, 2023). In addition, the speech source contains a wide range of information that can be exploited depending on the paradigm used (e.g., prosody is not handled by cascade systems -Zhou et al 2024).…”
Section: What Is Missing?mentioning
confidence: 99%
“…Distill-Spec (Zhou et al, 2023) further investigated the efficacy of knowledge distillation in enhancing the alignment between the target model and the drafter in speculative decoding. In addition to employing additional models as drafters, there has also been some research that proposes various strategies to efficiently generate drafts from the LLM itself (Santilli et al, 2023;. All the following research strongly backs up the value of this original work.…”
Section: Related Workmentioning
confidence: 94%
“…The non-autoregressive decoding, which generates multiple output tokens in parallel, was initially proposed by Gu et al (2018). Several works (Ghazvininejad et al, 2019;Gu and Kong, 2021;Savinov et al, 2022;Santilli et al, 2023) have since focused on enhancing generation quality in machine translation tasks. Subsequently, Leviathan et al (2023) introduced speculative decoding for sequence generation tasks.…”
Section: Parallel Decodingmentioning
confidence: 99%