Dysarthria refers to a speech disorder caused by trauma to the brain areas concerned with motor aspects of speech giving rise to effortful, slow, slurred or prosodically abnormal speech. Traditional Automatic Speech Recognizers (ASR) perform poorly on dysarthric speech recognition tasks, owing mostly to insufficient dysarthric speech data. Speaker related challenges complicates data collection process for dysarthric speech. In this paper, we explore data augmentation using temporal and speed modifications to healthy speech to simulate dysarthric speech. DNN-HMM based Automatic Speech Recognition (ASR) and Random Forest based classification were used for evaluation of the proposed method. Dysarthric speech, generated synthetically, is classified for severity level using a Random Forest classifier that is trained on actual dysarthric speech. ASR trained on healthy speech, augmented with simulated dysarthric speech is evaluated for dysarthric speech recognition. All evaluations were carried out using Universal Access dysarthric speech corpus. An absolute improvement of 4.24% and 2% WAS achieved using tempo based and speed based data augmentation respectively as compared to ASR performance using healthy speech alone for training.
Use of mixed language in day to day spoken speech is becoming common and is accepted as being syntactically correct. Ho wever mach ine recognition of mixed language spoken speech is a challenge to a conventional speech recognition engine. There are studies on how to enable recognition of mixed language speech. At one end of the spectra is to use acoustic models of the co mplete phone set of the mixed language to enable recognition wh ile on the other end of the spectra is to use a language identification module followed by language dependent speech recognition engines to do the recognition. Each of this has its own imp licat ions. In this paper, we approach the problem of mixed language speech recognition by using available resources and show that by suitably constructing an appropriate pronunciation dictionary and modifying the language model to use mixed language, one can achieve a good recognition accuracy of spoken mixed language.
End-to-end Spoken Language Understanding (SLU) systems, without speech-to-text conversion, are more promising in low resource scenarios. They can be more effective when there is not enough labeled data to train reliable speech recognition and language understanding systems, or where running SLU on edge is preferred over cloud based services. In this paper, we present an approach for bootstrapping end-to-end SLU in low resource scenarios. We show that incorporating layers extracted from pre-trained acoustic models, instead of using the typical Mel filter bank features, lead to better performing SLU models. Moreover, the layers extracted from a model pre-trained on one language perform well even for (a) SLU tasks on a different language and also (b) on utterances from speakers with speech disorder.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.