Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing 2021
DOI: 10.18653/v1/2021.sustainlp-1.14
|View full text |Cite
|
Sign up to set email alerts
|

Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

Abstract: Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec's applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…Reducing architectural complexity of large-scale pretrained models has become an indispensable research endeavor [15,8,16,17,18,19,20,21]. DistilHuBERT [8] is proposed to distill hidden representations from HuBERT BASE.…”
Section: Speech Pre-trained Model Compressionmentioning
confidence: 99%
“…Reducing architectural complexity of large-scale pretrained models has become an indispensable research endeavor [15,8,16,17,18,19,20,21]. DistilHuBERT [8] is proposed to distill hidden representations from HuBERT BASE.…”
Section: Speech Pre-trained Model Compressionmentioning
confidence: 99%
“…One common approach is knowledge distillation [11], which trains a small student model with a pre-specified architecture to match the soft targets generated by a large pre-trained model. Distillation has shown to be effective in natural language processing (NLP) [12,13] and speech processing [14,15,16,17], but it usually performs general distillation using large amounts of unlabeled data before task-specific distillation or fine-tuning. This can make the training procedure computationally expensive.…”
Section: Introductionmentioning
confidence: 99%