FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Conneau, Alexis; Ma, Min; Khanuja, Simran; Dalmia, Siddharth; Riesa, Jason; Rivera, Clara; Bapna, Ankur

doi:10.48550/arxiv.2205.12446

Cited by 3 publications

(8 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Automatic Speech Recognition (ASR): We use YouTube data to train USMs for YouTube (e.g., closed captions). We evaluate the USMs on two public benchmarks, SpeechStew [2] and FLEURS [16]. We also report results on the long-form test set CORAAL [17] for which only the evaluation set is available.…”

Section: Supervised Asr Trainingmentioning

confidence: 99%

“…SoTA results for downstream multilingual speech tasks: Our USM models achieve state-of-theart performance for multilingual ASR and AST for multiple datasets in multiple domains. This includes SpeechStew (mono-lingual ASR) [2], CORAAL (African American Vernacular English (AAVE) ASR) [17], FLEURS (multi-lingual ASR) [16], YT (multilingual long-form ASR), and CoVoST (AST from English to multiple languages). We depict our model's performance in the first panel of Fig.…”

Section: Key Findingsmentioning

confidence: 99%

“…We present our results on two public tasks, SpeechStew [2] and FLEURS [16], and an internal benchmark on YouTube.…”

Section: Speech Recognition (Asr)mentioning

confidence: 99%

“…The FLEURS [16] dataset is a publicly available, multi-way parallel dataset of 10 hours of read speech in 102 languages spanning 7 geo-groups. We restrict our use of the dataset to its ASR benchmark.…”

Section: Speech Recognition (Asr)mentioning

confidence: 99%

“…We also report full results for in-domain fine-tuning and adaptation. Unlike [16], we report both WER and CER metrics, as CER is inappropriate as an indicator of Table 3: WERs (%) across multiple tasks for multiple settings compared against pre-existing baselines, with the exception of CoVoST 2, for which the BLEU score is presented. For the YouTube long-form set, we select the top-25 languages Whisper was trained on and exclude all languages for which Whisper produces > 40% WER to reduce the noise introduced by LAS hallucination in the Whisper model.…”

Section: Speech Recognition (Asr)mentioning

confidence: 99%

See 4 more Smart Citations

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Park

Han

Qin

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model [1], our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

show abstract

Section: Supervised Asr Trainingmentioning

confidence: 99%

Section: Key Findingsmentioning

confidence: 99%

“…We present our results on two public tasks, SpeechStew [2] and FLEURS [16], and an internal benchmark on YouTube.…”

Section: Speech Recognition (Asr)mentioning

confidence: 99%

Section: Speech Recognition (Asr)mentioning

confidence: 99%

Section: Speech Recognition (Asr)mentioning

confidence: 99%

See 3 more Smart Citations

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Park

Han

Qin

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

Universal Adversarial Attacks on Spoken Language Assessment Systems

Vyas¹,

Gales²,

Knill³

2020

Interspeech 2020

View full text Add to dashboard Cite

There is an increasing demand for automated spoken language assessment (SLA) systems, partly driven by the performance improvements that have come from deep learning based approaches. One aspect of deep learning systems is that they do not require expert derived features, operating directly on the original signal such as a speech recognition (ASR) transcript. This, however, increases their potential susceptibility to adversarial attacks as a form of candidate malpractice. In this paper the sensitivity of SLA systems to a universal black-box attack on the ASR text output is explored. The aim is to obtain a single, universal phrase to maximally increase any candidate's score. Four approaches to detect such adversarial attacks are also described. All the systems, and associated detection approaches, are evaluated on a free (spontaneous) speaking section from a Business English test. It is shown that on deep learning based SLA systems the average candidate score can be increased by almost one grade level using a single six word phrase appended to the end of the response hypothesis. Although these large gains can be obtained, they can be easily detected based on detection shifts from the scores of a "traditional" Gaussian Process based grader.

show abstract

Untitled

2024

IJCI

View full text Add to dashboard Cite

Speech-to-speech translation is yet to reach the same level of coverage as text-to-text translation systems. The current speech technology is highly limited in its coverage of over 7000 languages spoken worldwide, leaving more than half of the population deprived of such technology and shared experiences. With voice-assisted technology (such as social robots and speech-to-text apps) and auditory content (such as podcasts and lectures) on the rise, ensuring that the technology is available for all is more important than ever. Speech translation can play a vital role in mitigating technological disparity and creating a more inclusive society. With a motive to contribute towards speech translation research for low-resource languages, our work presents a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. Additionally, we explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. The model, abbreviated as Unit-to-Unit Translation (U2UT), takes a sequence of discrete units of the source language (the language being translated from) and outputs a sequence of discrete units of the target language (the language being translated to). Our results show that the U2UT model performs better than the Speechto-Unit Translation (S2UT) model by a 3.69 BLEU score.

show abstract

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Cited by 3 publications

References 27 publications

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Universal Adversarial Attacks on Spoken Language Assessment Systems

Untitled

Contact Info

Product

Resources

About