Proceedings of the 2020 5th International Conference on Multimedia Systems and Signal Processing 2020
DOI: 10.1145/3404716.3404717
|View full text |Cite
|
Sign up to set email alerts
|

Speech Recognition using EfficientNet

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 13 publications
(5 citation statements)
references
References 18 publications
0
5
0
Order By: Relevance
“…We investigate 14 modern DNN architectures: seven variations of EfficientNetV2 [9] and seven variations of Efficient-NetV1 [8]. All these architectures were originally designed for image classification and have since been applied to various other problems such as speech recognition [26]. The accuracy of these architectures on the ImageNet dataset [27] ranges from 77.1% to 85.7%.…”
Section: Resultsmentioning
confidence: 99%
“…We investigate 14 modern DNN architectures: seven variations of EfficientNetV2 [9] and seven variations of Efficient-NetV1 [8]. All these architectures were originally designed for image classification and have since been applied to various other problems such as speech recognition [26]. The accuracy of these architectures on the ImageNet dataset [27] ranges from 77.1% to 85.7%.…”
Section: Resultsmentioning
confidence: 99%
“…[21] distorted the vocal tract length (VTLP) managing to improve ASR models by 2.5% TER (Token Error Rate). [22] made a substantial improvement in this technique by distorting noise addition, velocity adjustment and pitch shifting in the original audios, managing to reduce WER by 5.1%. DA techniques using a TTS model are also delivering good results.…”
Section: Data Augmentationmentioning
confidence: 99%
“…As a second baseline, the method of DA through audio distortion proposed by [22] was developed to augment the speech data of the original corpus. The nlpaug13 library was used to manipulate the training audios (99 hours) by modifying the speed according to a randomly selected coefficient in the range between 0.85 and 1.15, which is where this DA technique performs best.…”
Section: Distortion Baselinementioning
confidence: 99%
“…This model could be widely applicable for transfer learning in sound classification. Lu et al used the morphology of spectrograms as the input pattern to recognize speech using an EfficientNet model [10]. Padi et al employed transfer learning to improve the accuracy of speech emotion recognition through spectrogram augmentation [7].…”
Section: Introductionmentioning
confidence: 99%