ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053205
|View full text |Cite
|
Sign up to set email alerts
|

Specaugment on Large Scale Datasets

Abstract: Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Narayanan et al., 2018). We achieve improvement across all test domains by mixing raw training data augmented with SpecAu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
75
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 122 publications
(75 citation statements)
references
References 22 publications
0
75
0
Order By: Relevance
“…Acoustic features are 64dimensional log-mel filterbanks with a frame shift of 10ms which are stacked and downsampled by a factor of 3. For feature augmentation we employ LibriFullAdapt SpecAugment policy from [21]. We use Adam algorithm [22] for optimization of all models, and the learning rate is scheduled based on warm-up, hold and decay strategy as proposed in [23].…”
Section: Methodsmentioning
confidence: 99%
“…Acoustic features are 64dimensional log-mel filterbanks with a frame shift of 10ms which are stacked and downsampled by a factor of 3. For feature augmentation we employ LibriFullAdapt SpecAugment policy from [21]. We use Adam algorithm [22] for optimization of all models, and the learning rate is scheduled based on warm-up, hold and decay strategy as proposed in [23].…”
Section: Methodsmentioning
confidence: 99%
“…In order to reduce the effective frame rate, features from four adjacent frames are concatenated together (to produce 512 dimensional features), which are further sub-sampled by a factor of 3, so that the effective input frame rate is 30ms. In this work, we also apply SpecAugment masks [9] using the configuration described in [31], which we find to improve performance over the system in [12]. The encoder network in all of our experiments is modeled using a stack of 8 unidirectional LSTM [29] layers, each of which contains 2,048 units and a projection layer of 640 units.…”
Section: Methodsmentioning
confidence: 99%
“…We tuned the loss parameter β over three different values (1e-2, 1e-3 and 1e-4), while keeping all other training parameters unchanged and found that best performance is achieved at β = 1e-3. An adaptive SpecAugment [29,30] policy with two frequency masks with mask parameter F = 27, and ten time masks with maximum time-mask ratio p S = 0.05 has been used to augment the input, which was shared by the teacher and student model. The performance of the trained network is recorded in table 1.…”
Section: Librispeech Experimentsmentioning
confidence: 99%
“…Models were trained on a large multi-domain dataset similar to that described in [12], where the domains include Search and FarField. The shared input of the teacher and student models are augmented using SpecAugment [29,30] and multi-style training [31]. The architecture of our uncompressed 0% sparse RNN-T model, also the teacher model for distillation, is similar to that described in [25], and is as follows.…”
Section: Multi-domain Experimentsmentioning
confidence: 99%