Silent speech recognition (SSR) converts non-audio information such as articulatory movements into text. SSR has the
potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely
relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers
has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are
critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent
SSR from the movements of flesh points on tongue and lip with articulatory normalization methods that reduce the inter-speaker
variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based
articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory
data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short
term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with
long-range articulatory history. A silent speech data set with flesh points was collected using an electromagnetic articulograph
(EMA) from twelve healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our
speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed standard deep
neural network. The best performance was obtained by BLSTM with all the three normalization approaches combined.
Purpose:
This research aimed to automatically predict intelligible speaking rate for individuals with Amyotrophic Lateral Sclerosis (ALS) based on speech acoustic and articulatory samples.
Method:
Twelve participants with ALS and two normal subjects produced a total of 1,831 phrases. NDI Wave system was used to collect tongue and lip movement and acoustic data synchronously. A machine learning algorithm (i.e. support vector machine) was used to predict intelligible speaking rate (speech intelligibility × speaking rate) from acoustic and articulatory features of the recorded samples.
Result:
Acoustic, lip movement, and tongue movement information separately, yielded a R2 of 0.652, 0.660, and 0.678 and a Root Mean Squared Error (RMSE) of 41.096, 41.166, and 39.855 words per minute (WPM) between the predicted and actual values, respectively. Combining acoustic, lip and tongue information we obtained the highest R2 (0.712) and the lowest RMSE (37.562 WPM).
Conclusion:
The results revealed that our proposed analyses predicted the intelligible speaking rate of the participant with reasonably high accuracy by extracting the acoustic and/or articulatory features from one short speech sample. With further development, the analyses may be well-suited for clinical applications that require automatic speech severity prediction.
Dysarthria is a motor speech disorder that impedes the physical production of speech. Speech in patients with dysarthria is generally characterized by poor articulation, breathy voice, and monotonic intonation. Therefore, modeling the spectral and temporal characteristics of dysarthric speech is critical for better performance in dysarthric speech recognition. Convolutional long short-term memory recurrent neural networks (CLSTM-RNNs) have recently successfully been used in normal speech recognition, but have rarely been used in dysarthric speech recognition. We hypothesized CLSTM-RNNs have the potential to capture the distinct characteristics of dysarthric speech, taking advantage of convolutional neural networks (CNNs) for extracting effective local features and LSTM-RNNs for modeling temporal dependencies of the features. In this paper, we investigate the use of CLSTM-RNNs for dysarthric speech recognition. Experimental evaluation on a database collected from nine dysarthric patients showed that our approach provides substantial improvement over both standard CNN and LSTM-RNN based speech recognizers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.