Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2164
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
49
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 57 publications
(50 citation statements)
references
References 0 publications
0
49
0
1
Order By: Relevance
“…A spectrogram is one of the most used visual input representations of speech signals in speech analysis tasks, such as ASR [ 36 ] and SER [ 24 ] using deep learning (DL) models. It demonstrates the signal strength over time at different frequencies present in a particular waveform.…”
Section: Proposed Age and Gender Classification Methodologymentioning
confidence: 99%
“…A spectrogram is one of the most used visual input representations of speech signals in speech analysis tasks, such as ASR [ 36 ] and SER [ 24 ] using deep learning (DL) models. It demonstrates the signal strength over time at different frequencies present in a particular waveform.…”
Section: Proposed Age and Gender Classification Methodologymentioning
confidence: 99%
“…For every target language, a subword vocabulary of size 100 is generated using the SentencePiece [19] toolkit. We employ the aforementioned subword-based LID-42 model presented in [7] as the pre-trained multilingual ASR model, which consists of 12 encoder layers and 6 decoder layers with a model dimension of 256. The number of multihead attention heads is 4 and the inner-dimension of the feedforward network is 2048.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…Pretap et al [6] introduced a massive single E2E model with up to 1 billion parameters trained on 50 languages. Nearly at the same time, Hou et al [7] reported a super language-independent Transformerbased ASR model (LID-42) jointly trained on 6 million training utterances from 42 languages with hybrid CTC-attention multi-task learning [8]. Both of them achieved a significant recognition accuracy improvement on low-resource ASR via transfer learning.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…With the exception of a few recent works [5,6,7], most previous work on multilingual speech recognition focuses on the benefits of these models for lower-resource or related languages. Nevertheless, in order for these models to be utilized in real-world scenarios and replace their monolingual counterparts, they need to target a variety of languages, with large • Introduction of an informed mixture-of-experts layer, used in the encoder of an RNN-T model, where each expert is assigned to one language, or set of related languages.…”
Section: Introductionmentioning
confidence: 99%