Findings of the Association for Computational Linguistics: EMNLP 2022 2022
DOI: 10.18653/v1/2022.findings-emnlp.142
|View full text |Cite
|
Sign up to set email alerts
|

RedApt: An Adaptor for wav2vec 2 EncodingFaster and Smaller Speech Translation without Quality Compromise

Abstract: Pre-trained speech Transformers in speech translation (ST) have facilitated state-of-theart (SotA) results; yet, using such encoders is computationally expensive. To improve this, we present a novel Reducer Adaptor block, RedApt, that could be seamlessly integrated within any Transformer-based speech encoding architecture. Integrating the pretrained WAV2VEC 2 speech encoder with RedApt brings 41% speedup, 33% memory reduction with 24% fewer FLOPs at inference. To our positive surprise, our ST model with RedApt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 15 publications
0
1
0
Order By: Relevance
“…Such distance metrics include the euclidean Tang et al, 2021a), cosine (Chuang et al, 2020), and contrastive Ouyang et al, 2023), but they typically require some transformation in the representations, such as mean-pooling, while our approach optimizes a distance that does not alter the representation space. Methods to reduce the length discrepancy usually include sub-sampling the speech representation using convolutional length adaptors Gállego et al, 2021;Fang et al, 2022;Zhao et al, 2022) or character/phonemebased CTC compression Xu et al, 2021a). Several methods have also used phonemized text in order to better match the representations in both length and content (Tang et al, 2021a(Tang et al, , 2022Le et al, 2023), but potentially limiting the quality of the text branch due to noise.…”
Section: Introductionmentioning
confidence: 99%
“…Such distance metrics include the euclidean Tang et al, 2021a), cosine (Chuang et al, 2020), and contrastive Ouyang et al, 2023), but they typically require some transformation in the representations, such as mean-pooling, while our approach optimizes a distance that does not alter the representation space. Methods to reduce the length discrepancy usually include sub-sampling the speech representation using convolutional length adaptors Gállego et al, 2021;Fang et al, 2022;Zhao et al, 2022) or character/phonemebased CTC compression Xu et al, 2021a). Several methods have also used phonemized text in order to better match the representations in both length and content (Tang et al, 2021a(Tang et al, , 2022Le et al, 2023), but potentially limiting the quality of the text branch due to noise.…”
Section: Introductionmentioning
confidence: 99%