ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054130
|View full text |Cite
|
Sign up to set email alerts
|

Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation

Abstract: Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhibited by the training data with the help of data augmentation. In this paper we examine the influence of three data … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
60
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 85 publications
(60 citation statements)
references
References 16 publications
0
60
0
Order By: Relevance
“…FBK (Gaido et al, 2020) participated with an end-to-end-system adapting the S-Transformer model (Di Gangi et al, 2019b,c). Its training is based on: i) transfer learning (via ASR pretraining and -word/sequence -knowledge distillation), ii) data augmentation (with SpecAugment (Park et al, 2019), time stretch (Nguyen et al, 2020a) and synthetically-created data), iii) combining synthetic and real data marked as different "domains" as in (Di Gangi et al, 2019d), and iv) multitask learning using the CTC loss (Graves et al, 2006). Once the training with wordlevel knowledge distillation is complete the model is fine-tuned using label smoothed cross entropy (Szegedy et al, 2016).…”
Section: Submissionsmentioning
confidence: 99%
See 1 more Smart Citation
“…FBK (Gaido et al, 2020) participated with an end-to-end-system adapting the S-Transformer model (Di Gangi et al, 2019b,c). Its training is based on: i) transfer learning (via ASR pretraining and -word/sequence -knowledge distillation), ii) data augmentation (with SpecAugment (Park et al, 2019), time stretch (Nguyen et al, 2020a) and synthetically-created data), iii) combining synthetic and real data marked as different "domains" as in (Di Gangi et al, 2019d), and iv) multitask learning using the CTC loss (Graves et al, 2006). Once the training with wordlevel knowledge distillation is complete the model is fine-tuned using label smoothed cross entropy (Szegedy et al, 2016).…”
Section: Submissionsmentioning
confidence: 99%
“…(1) ASR (both LSTM (Nguyen et al, 2020b) and Transformer-based (Pham et al, 2019a)) ( 2) Segmentation (with a monolingual NMT system (Sperber et al, 2018) that adds sentence boundaries and case, also inserting proper punctuation), and (3) MT (a Transformer-based encoderdecoder model implementing Relative Attention following (Dai et al, 2019) adapted via fine-tuning on data incorporating artificially-injected noise). The WerRTCVAD toolkit 15 is used to process the unsegmented test set.…”
Section: Submissionsmentioning
confidence: 99%
“…Model We only focus on sequence-to-sequence ASR models, which are based on two different network architectures: The long short-term memory (LSTM) and the Transformer. Our LSTM-based models consist of 6 bidirectional layers of 1024 units for the encoder and 2 unidirectional layers for the decoder (Nguyen et al, 2019). Our transformerbased models presented in (Pham et al, 2019b) consist of 32 blocks for the encoder and 12 blocks for the decoder.…”
Section: Speech Recognitionmentioning
confidence: 99%
“…To this aim, we rely on data augmentation and knowledge transfer techniques that were shown to yield competitive models at the IWSLT-2020 evaluation campaign (Ansari et al, 2020;Potapczyk and Przybysz, 2020;Gaido et al, 2020). In particular, we use three data augmentation methods -SpecAugment (Park et al, 2019), time stretch (Nguyen et al, 2020), and synthetic data generation (Jia et al, 2019) -and we transfer knowledge both from ASR and MT through component initialization and knowledge distillation (Hinton et al, 2015).…”
Section: Base St Modelmentioning
confidence: 99%