Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

Yu, Chongchong; Kang, Mengzhen; Chen, Yunbing; Wu, Jiajia; Zhao, Xia

doi:10.1109/access.2020.3020421

Cited by 20 publications

(7 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Ref. [59], DL-based methods, such as a convolutional recurrent neural network (CRNN), temporal convolutional network (TCN), concept-level TCN (CTCN), and CNN, were also applied in both group classification and gender recognition for six categories (as seen in Table 5), which combined different DL models [32,35,37,38] to establish a larger size of the classifier network and offered a gender identification error of <2% and an age group classification error of <20% [59]. In the feature extraction layer, the STFT (with the hamming window) and MFCC methods were used to extract the melscale feature patterns.…”

Section: Discussionmentioning

confidence: 99%

“…Herein, in classification task, we intend to design a ML‐based or a DL‐based classifier to automatically perform the voice classification and gender identification, including adult males, adult females, and children (boys and girls) [6, 35–37]. To deal with the one‐dimensional (1D) signals, 1D CNN and two‐dimensional (2D) CNN models can be used for digital signal classification in audio and bio‐signal recognition [24, 38–42].…”

Section: Introductionmentioning

confidence: 99%

“…Herein, in classification task, we intend to design a MLbased or a DL-based classifier to automatically perform the voice classification and gender identification, including adult males, adult females, and children (boys and girls) [6,[35][36][37].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Vowel classification with combining pitch detection and one‐dimensional convolutional neural network based classifier for gender identification

Lin

Lai

Huang

et al. 2023

IET Signal Processing

View full text Add to dashboard Cite

Human speech signals may contain specific information regarding a speaker's characteristics, and these signals can be very useful in applications involving interactive voice response (IVR) and automatic speech recognition (ASR). For IVR and ASR applications, speaker classification into different ages and gender groups can be applied in humanmachine interaction or computer-based interaction systems for customised advertisement, translation (text generation), machine dialog systems, or self-service applications. Hence, an IVR-based system dictates that ASR should function through users' voices (specific voice-frequency bands) to identify customers' age and gender and interact with a host system. In the present study, we intended to combine a pitch detection (PD)-based extractor and a voice classifier for gender identification. The Yet Another Algorithm for Pitch Tracking (YAAPT)-based PD method was designed to extract the voice fundamental frequency (F 0 ) from non-stationary speaker's voice signals, allowing us to achieve gender identification, by distinguishing differences in F 0 between adult females and males, and classify voices into adult and children groups. Then, in vowel voice signal classification, a one-dimensional (1D) convolutional neural network (CNN), consisted of a multi-round 1D kernel convolutional layer, a 1D pooling process, and a vowel classifier that could preliminary divide feature patterns into three level ranges of F 0 , including adult and children groups. Consequently, a classifier was used in the classification layer to identify the speakers' gender. The proposed PD-based extractor and voice classifier could reduce complexity and improve classification efficiency. Acoustic datasets were selected from the Hillenbrand database for experimental tests on 12 vowels classifications, and K-fold cross-validations were performed. The experimental results demonstrated that our

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Vowel classification with combining pitch detection and one‐dimensional convolutional neural network based classifier for gender identification

Lin

Lai

Huang

et al. 2023

IET Signal Processing

View full text Add to dashboard Cite

show abstract

“…Orken Zh et al proposed a joint model based on CTC and the attention mechanism for recognition of Kazakh speech in noisy conditions [ 54 ]. In addition to the improvement of the model structure, some important technologies are often applied to low-resource speech recognition, which is also the key to improving performance [ 55 ]. The most widespread application for these is data augmentation, a technology for increasing the amount of data needed for training speech recognition systems.…”

Section: Related Workmentioning

confidence: 99%

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ren

Yolwas

Slamu

et al. 2022

Sensors

View full text Add to dashboard Cite

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.

show abstract

“…Several voice assistants can currently recognize human speech patterns through an interactive real-time smart dialogue and apply automatic techniques based on the recognized content, such as Google’s Assistant and Apple’s Siri, which can converse in over 40 and 35 languages, respectively [ 11 ]. The majority of popular ASR systems use Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), and Deep Neural Networks (DNNs) [ 12 , 13 , 14 , 15 ]. DNNs play an essential part in the building of ASR systems [ 16 , 17 ], mostly because of the evolution of unique neural network models, as well as training and classification techniques [ 18 , 19 ].…”

Section: Introductionmentioning

confidence: 99%

Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language

Mukhamadiyev

Khujayorov

Djuraev

et al. 2022

Sensors

View full text Add to dashboard Cite

Communication has been an important aspect of human life, civilization, and globalization for thousands of years. Biometric analysis, education, security, healthcare, and smart cities are only a few examples of speech recognition applications. Most studies have mainly concentrated on English, Spanish, Japanese, or Chinese, disregarding other low-resource languages, such as Uzbek, leaving their analysis open. In this paper, we propose an End-To-End Deep Neural Network-Hidden Markov Model speech recognition model and a hybrid Connectionist Temporal Classification (CTC)-attention network for the Uzbek language and its dialects. The proposed approach reduces training time and improves speech recognition accuracy by effectively using CTC objective function in attention model training. We evaluated the linguistic and lay-native speaker performances on the Uzbek language dataset, which was collected as a part of this study. Experimental results show that the proposed model achieved a word error rate of 14.3% using 207 h of recordings as an Uzbek language training dataset.

show abstract

Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

Cited by 20 publications

References 77 publications

Vowel classification with combining pitch detection and one‐dimensional convolutional neural network based classifier for gender identification

Vowel classification with combining pitch detection and one‐dimensional convolutional neural network based classifier for gender identification

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language

Contact Info

Product

Resources

About