2021
DOI: 10.48550/arxiv.2111.05948
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scaling ASR Improves Zero and Few Shot Learning

Abstract: With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…Both teams have also attempted to use out-of-domain speaker data to improve ASR performance in various ways [13,14]. Recently, FacebookAI researchers have improved ASR performance in the English AphasiaBank by using neural models trained with vast amounts of data, and then adapted to various domains [8]. Semi-supervised learning was also used recently in an attempt to improve ASR performance in both English and Spanish [11].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Both teams have also attempted to use out-of-domain speaker data to improve ASR performance in various ways [13,14]. Recently, FacebookAI researchers have improved ASR performance in the English AphasiaBank by using neural models trained with vast amounts of data, and then adapted to various domains [8]. Semi-supervised learning was also used recently in an attempt to improve ASR performance in both English and Spanish [11].…”
Section: Related Workmentioning
confidence: 99%
“…For that purpose, we investigate extending current crosslingual aphasia detection pipelines by incorporating Automatic Speech Recognition models for two closely related to English low-resource languages, i.e., French and Greek. Since training an ASR domain-specific model for these languages requires a considerable amount of collected speech data that is currently unavailable, we use large pre-trained ASR models that have demonstrated significant capabilities across several domains [8]. Specifically, we leverage pre-trained XLSR-53 architectures [9] based on the Wav2Vec2.0 audio representation [10], that have been proven to work for other low-resource languages on AphasiaBank, i.e., Spanish [11].…”
Section: Introductionmentioning
confidence: 99%
“…For pretraining, we use a mixture of data sources, including public videos, audios collected by paid third-party vendors on portal, and synthetic audios generated by text-to-speech. This dataset contains 1.5 Million hours of audios, which are transcribed either through human annotators or a large teacher model (see [33] for details). For adaptation, we use assistant and dictation audios recorded on edge devices, which are collected by external vendors under clean acoustic conditions.…”
Section: Data Setupmentioning
confidence: 99%
“…Few-shot and even zero-shot approaches to pathological speech recognition can be successful [1,2,3]. Out of the box, a very large acoustic model with up to 10 billion parameters trained on 4.5 million hours of speech [1] reaches state-of-the-art performance on AphasiaBank [4], a database of aphasic speech. Fine-tuning on this data gives a further 50% relative improvement.…”
Section: Introductionmentioning
confidence: 99%