2020
DOI: 10.1109/access.2020.3020421
|View full text |Cite
|
Sign up to set email alerts
|

Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

Abstract: The polarization of world languages is becoming more and more obvious. Many languages, mainly endangered languages, are of low-resource attribute due to lack of information. Both language conservation and cultural heritage face important challenges. Therefore, speech recognition for lowresource scenario has become a hot topic in the field of speech. Based on the complex network structures and huge model parameters, deep learning has become a powerful science in the process of speech recognition, which has a br… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(7 citation statements)
references
References 77 publications
0
7
0
Order By: Relevance
“…In Ref. [59], DL-based methods, such as a convolutional recurrent neural network (CRNN), temporal convolutional network (TCN), concept-level TCN (CTCN), and CNN, were also applied in both group classification and gender recognition for six categories (as seen in Table 5), which combined different DL models [32,35,37,38] to establish a larger size of the classifier network and offered a gender identification error of <2% and an age group classification error of <20% [59]. In the feature extraction layer, the STFT (with the hamming window) and MFCC methods were used to extract the melscale feature patterns.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…In Ref. [59], DL-based methods, such as a convolutional recurrent neural network (CRNN), temporal convolutional network (TCN), concept-level TCN (CTCN), and CNN, were also applied in both group classification and gender recognition for six categories (as seen in Table 5), which combined different DL models [32,35,37,38] to establish a larger size of the classifier network and offered a gender identification error of <2% and an age group classification error of <20% [59]. In the feature extraction layer, the STFT (with the hamming window) and MFCC methods were used to extract the melscale feature patterns.…”
Section: Discussionmentioning
confidence: 99%
“…Herein, in classification task, we intend to design a ML‐based or a DL‐based classifier to automatically perform the voice classification and gender identification, including adult males, adult females, and children (boys and girls) [6, 35–37]. To deal with the one‐dimensional (1D) signals, 1D CNN and two‐dimensional (2D) CNN models can be used for digital signal classification in audio and bio‐signal recognition [24, 38–42].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Orken Zh et al proposed a joint model based on CTC and the attention mechanism for recognition of Kazakh speech in noisy conditions [ 54 ]. In addition to the improvement of the model structure, some important technologies are often applied to low-resource speech recognition, which is also the key to improving performance [ 55 ]. The most widespread application for these is data augmentation, a technology for increasing the amount of data needed for training speech recognition systems.…”
Section: Related Workmentioning
confidence: 99%
“…Several voice assistants can currently recognize human speech patterns through an interactive real-time smart dialogue and apply automatic techniques based on the recognized content, such as Google’s Assistant and Apple’s Siri, which can converse in over 40 and 35 languages, respectively [ 11 ]. The majority of popular ASR systems use Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), and Deep Neural Networks (DNNs) [ 12 , 13 , 14 , 15 ]. DNNs play an essential part in the building of ASR systems [ 16 , 17 ], mostly because of the evolution of unique neural network models, as well as training and classification techniques [ 18 , 19 ].…”
Section: Introductionmentioning
confidence: 99%