ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054382
|View full text |Cite
|
Sign up to set email alerts
|

Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition

Abstract: In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain the same through … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…This can be done with special data augmentation. For instance in speech recognition, it can be done by randomly removing low-energy parts of the recording [45]. In our case, it could be done by randomly inserting bias into images during the training, similar to online data augmentation.…”
Section: Discussionmentioning
confidence: 99%
“…This can be done with special data augmentation. For instance in speech recognition, it can be done by randomly removing low-energy parts of the recording [45]. In our case, it could be done by randomly inserting bias into images during the training, similar to online data augmentation.…”
Section: Discussionmentioning
confidence: 99%
“…Further, 7th root compression is applied instead of logarithmic, as in Gammatone or PLP features [25,26]. The alternative feature extraction pipeline is also extended with small-energy masking (SEM) perturbation [27]. Relative to the peak energy in a given utterance, the method masks time-frequency bins with small energy in the Mel-spectral domain.…”
Section: Methodsmentioning
confidence: 99%
“…Room impulse response simulation and adding point-source noises were proposed for far-field ASR [25]. Inspired by input dropout, [26] proposed to improve the noise robustness of CNN acoustic models by discarding input features, and [27] proposed to mask time-frequency bins with energy lower than randomised thresholds. Recently, SpecAugment [28] (SA) was proposed to augment speech data by warping spectrograms along the time axis, and masking time and/or frequency bands in the spectral domain.…”
Section: Related Workmentioning
confidence: 99%