DNN based continuous speech recognition system of Punjabi language on Kaldi toolkit

Guglani, Jyoti; Mishra, Achyuta Nand

doi:10.1007/s10772-020-09717-8

Cited by 18 publications

(11 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Increasing the number of hidden layers, improve the performance of the network, however, it impose serious issue such as computation cost, network complexity and model overfitting. It has been shown that the deep algorithms employed successfully in a number of fields such as image recognition [55]- [57], speech recognition [58], [59], natural language processing [60], [61] and bioinformatics [62]- [64]. Additionally, it has been presented by several researchers that the DNN demonstrated superior performance over the traditional learning approaches employed for a various complex problems [53] [65].…”

Section: Heterogeneous Feature Vectormentioning

confidence: 99%

iEnhancer-DHF: Identification of Enhancers and Their Strengths Using Optimize Deep Neural Network With Multiple Features Extraction Methods

et al. 2021

View full text Add to dashboard Cite

Enhancers are short DNA regulatory elements which play a vital role in gene expression. Due to their important roles in genomics, several computational models have been proposed in the literature for identification of enhancers and their strengths using traditional machine learning algorithms, however, the proposed models are unable to identify enhancers and their strength with reasonable accuracy because of high non-linearity in DNA sequences. This paper proposes a two-level intelligent model based on Deep Neural Network (DNN) along with multiple feature extraction methods. Firstly, the proposed model represents the given DNA sequences into feature vectors using Pseudo K-tuple Nucleotide Composition (PseKNC) and FastText methods. Secondly, the features vectors are fused to make a heterogeneous features vector that considered the local and global correlation amongst the given sequences along with internal structure information. Finally, the heterogeneous feature vector is given to a DNN model to make final predictions. The proposed iEnhancer-DHF is developed using two-layer approach. The first layer predicts whether the given DNA samples are enhancers or non-enhancers whereas the second layer identifies either the enhancers are strong enhancers or weak enhancers. The outcome of the proposed model was rigorously assessed using both training and independent datasets via 10-fold cross validation method. The validation outcome demonstrated that the iEnhancer-DHF model yielded accuracies 86.07% and 69.60% at first layer and second layer respectively utilizing the training dataset. Similarly, the model yielded accuracies 83.21% and 67.54% at first layer and at second layer respectively by using the independent dataset. Additionally, the outcomes of the proposed model was initially compared with widely applied classifiers such as Support Vector Machine, Random Forest and K-nearest Neighbor and subsequently the performance is compared with the existing models using both the training and independent datasets. The comparison results exhibited that the iEnhancer-DHF model performed superior than the recently published models.

show abstract

Section: Heterogeneous Feature Vectormentioning

confidence: 99%

iEnhancer-DHF: Identification of Enhancers and Their Strengths Using Optimize Deep Neural Network With Multiple Features Extraction Methods

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The audio module has no built in analysis, nor classification capabilities, as this is deferred to the text processing module. Kaldi [15] is ideal for this transcription task, as it can be integrated at the operating system level, making these audio signals fully available to our solution. Capturing audio signals is independent from screen capturing, which is why the audio and image analysis tasks are naturally divided.…”

Section: A the Audio Modulementioning

confidence: 99%

“…Kaldi is appropriate for the child protection context mainly because it is flexible in controlling all parts of the speech-totext conversion and could easily adapt to different noisy environments by integrating different acoustic modelling scripts FIGURE 3. The design of the transcription system [15] at the operating system level. This approach is also flexible, meaning that it could be used locally, in an edge-computing or cloud-computing architecture.…”

Section: A the Audio Modulementioning

confidence: 99%

Keeping Children Safe Online With Limited Resources: Analyzing What is Seen and Heard

et al. 2021

View full text Add to dashboard Cite

It is every parent's wish to protect their children from online pornography, cyber bullying and cyber predators. Several existing approaches analyze a limited amount of information stemming from the interactions of the child with the corresponding online party. Some restrict access to websites based on a blacklist of known forbidden URLs, others attempt to parse and analyze the exchanged multimedia content between the two parties. However, new URLs can be used to circumvent a blacklist, and images, video, and text can individually appear to be safe, but need to be judged jointly. We propose a highly modular framework of analyzing content in its final form at the user interface, or Human Computer Interaction (HCI) layer, as it appears before the child: on the screen and through the speakers. Our approach is to produce Children's Agents for Secure and Privacy Enhanced Reaction (CASPER), which analyzes screen captures and audio signals in real time in order to make a decision based on all of the information at its disposal, with limited hardware capabilities. We employ a collection of deep learning techniques for image, audio and text processing in order to categorize visual content as pornographic or neutral, and textual content as cyberbullying or neutral. We additionally contribute a custom dataset that offers a wide spectrum of objectionable content for evaluation and training purposes. CASPER demonstrates an average accuracy of 88% and an F1 score of 0.85 when classifying text, and an accuracy of 95% when classifying pornography.INDEX TERMS Cyber-bullying, Cyber-grooming, Online Safety, Pornography filter, Real time agent.

show abstract

“…In addition, several techniques were proposed by the researchers to improve the acoustic variabilities. Different front-end feature extraction techniques perceptual linear prediction (PLP), spectrum-based feature extraction, and Mel-Frequency cepstral coefficients (MFCC) have been used to extract the acoustic features [1,9,[12][13][14]. Researchers have also made minor changes in the feature extraction process implemented in the front-end, these pitch features are also used for improving the speech recognition rate [10,13,[15][16][17].…”

Section: Related Workmentioning

confidence: 99%

Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Bhardwaj¹,

Kukreja²,

Singh³

2021

RIA

View full text Add to dashboard Cite

Most of the automatic speech recognition (ASR) systems are trained using adult speech due to the less availability of the children's speech dataset. The speech recognition rate of such systems is very less when tested using the children's speech, due to the presence of the inter-speaker acoustic variabilities between the adults and children's speech. These inter-speaker acoustic variabilities are mainly because of the higher pitch and lower speaking rate of the children. Thus, the main objective of the research work is to increase the speech recognition rate of the Punjabi-ASR system by reducing these inter-speaker acoustic variabilities with the help of prosody modification and speaker adaptive training. The pitch period and duration (speaking rate) of the speech signal can be altered with prosody modification without influencing the naturalness, message of the signal and helps to overcome the acoustic variations present in the adult's and children's speech. The developed Punjabi-ASR system is trained with the help of adult speech and prosody-modified adult speech. This prosody modified speech overcomes the massive need for children's speech for training the ASR system and improves the recognition rate. Results show that prosody modification and speaker adaptive training helps to minimize the word error rate (WER) of the Punjabi-ASR system to 8.79% when tested using children's speech.

show abstract

DNN based continuous speech recognition system of Punjabi language on Kaldi toolkit

Cited by 18 publications

References 20 publications

iEnhancer-DHF: Identification of Enhancers and Their Strengths Using Optimize Deep Neural Network With Multiple Features Extraction Methods

iEnhancer-DHF: Identification of Enhancers and Their Strengths Using Optimize Deep Neural Network With Multiple Features Extraction Methods

Keeping Children Safe Online With Limited Resources: Analyzing What is Seen and Heard

Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Contact Info

Product

Resources

About