BackgroundTo evaluate the interest of using automatic speech analyses for the assessment of mild cognitive impairment (MCI) and early-stage Alzheimer's disease (AD).MethodsHealthy elderly control (HC) subjects and patients with MCI or AD were recorded while performing several short cognitive vocal tasks. The voice recordings were processed, and the first vocal markers were extracted using speech signal processing techniques. Second, the vocal markers were tested to assess their “power” to distinguish among HC, MCI, and AD. The second step included training automatic classifiers for detecting MCI and AD, using machine learning methods and testing the detection accuracy.ResultsThe classification accuracy of automatic audio analyses were as follows: between HCs and those with MCI, 79% ± 5%; between HCs and those with AD, 87% ± 3%; and between those with MCI and those with AD, 80% ± 5%, demonstrating its assessment utility.ConclusionAutomatic speech analyses could be an additional objective assessment tool for elderly with cognitive decline.
We present a new implementation of emotion recognition from the para-lingual information in the speech, based on a deep neural network, applied directly to spectrograms. This new method achieves higher recognition accuracy compared to previously published results, while also limiting the latency. It processes the speech input in smaller segments-up to 3 seconds, and splits a longer input into non-overlapping parts to reduce the prediction latency. The deep network comprises common neural network toolsconvolutional and recurrent networks-which are shown to effectively learn the information that represents emotions directly from spectrograms. Convolution-only lowercomplexity deep network achieves a prediction accuracy of 66% over four emotions (tested on IEMOCAP-a common evaluation corpus), while a combined convolution-LSTM higher-complexity model achieves 68%. The use of spectrograms in the role of speech-representing features enables effective handling of background non-speech signals such as music (excl. singing) and crowd noise, even at noise levels comparable with the speech signal levels. Using harmonic modeling to remove non-speech components from the spectrogram, we demonstrate significant improvement of the emotion recognition accuracy in the presence of unknown background non-speech signals.
We are interested in retrieving information from conversational speech corpora, such as call-center data. This data comprises spontaneous speech conversations with low recording quality, which makes automatic speech recognition (ASR) a highly difficult task. For typical call-center data, even state-of-the-art large vocabulary continuous speech recognition systems produce a transcript with word error rate of 30% or higher. In addition to the output transcript, advanced systems provide word confusion networks (WCNs), a compact representation of word lattices associating each word hypothesis with its posterior probability. Our work exploits the information provided by WCNs in order to improve retrieval performance. In this paper, we show that the mean average precision (MAP) is improved using WCNs compared to the raw word transcripts. Finally, we analyze the effect of increasing ASR word error rate on search effectiveness. We show that MAP is still reasonable even under extremely high error rate.
No abstract
Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-ofthe-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in which the textto-intent data is converted into speech-to-intent data using a multispeaker text-to-speech system. The proposed approaches recover 80% of performance lost due to using limited intent-labeled speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.