2021
DOI: 10.48550/arxiv.2111.09296
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
118
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 51 publications
(121 citation statements)
references
References 30 publications
3
118
0
Order By: Relevance
“…We use XLS-R -U as pretraining data and finetune it on MLS-10hrs. As shown in Table 3, our baseline w2v-BERT already outperform previous strong model from XLS-R(2B) (Babu et al, 2021). The average WER further bring down by 3% relative by using the proposed BEST-RQ.…”
Section: Results On Mls-10hrsmentioning
confidence: 72%
See 2 more Smart Citations
“…We use XLS-R -U as pretraining data and finetune it on MLS-10hrs. As shown in Table 3, our baseline w2v-BERT already outperform previous strong model from XLS-R(2B) (Babu et al, 2021). The average WER further bring down by 3% relative by using the proposed BEST-RQ.…”
Section: Results On Mls-10hrsmentioning
confidence: 72%
“…XLS-R unsupervised data (XLS-R -U) Our public unlabeled speech data follows the pre-training data used for XLS-R (Babu et al, 2021) with one major difference: we do not use any data from VoxLingua-107 due to license constraint. In total, we utilize approximately 429k hours of unlabeled speech data in 51 1 languages.…”
Section: Datamentioning
confidence: 99%
See 1 more Smart Citation
“…On the other side, the first layers are better, but still not optimal, probably due to the low contextual information. We find that middle layers (11)(12)(13)(14)(15)(16)(17)(18)(19)(20) have the most informative representation that can be used by the classifier. More specifically, the output of the 14th layer, achieves the best segmentation, retaining almost 95% of the manual BLEU score.…”
Section: A Appendixmentioning
confidence: 99%
“…After self-supervision, it can be fine-tuned to downstream tasks like ASR. Its multilingual version, XLS-R [13] has been pre-trained on 128 languages using 436k hours of speech data. Audio Segmentation methods.…”
Section: Introductionmentioning
confidence: 99%