Picture description tasks are used for the detection of cognitive decline associated with Alzheimer's disease (AD). Recent years have seen work on automatic AD detection in picture descriptions based on acoustic and word-based analysis of the speech. These methods have shown some success but lack an ability to capture any higher level effects of cognitive decline on the patient's language. In this paper, we propose a novel model that encompasses both the hierarchical and sequential structure of the description and detect its informative units by attention mechanism. Automatic speech recognition (ASR) and punctuation restoration are used to transcribe and segment the data. Using the DementiaBank database of people with AD as well as healthy controls (HC), we obtain an F-score of 84.43% and 74.37% when using manual and automatic transcripts respectively. We further explore the effect of adding additional data (a total of 33 descriptions collected using a 'digital doctor') during model training, and increase the F-score when using ASR transcripts to 76.09%. This outperforms baseline models, including bidirectional LSTM and bidirectional hierarchical neural network without an attention mechanism, and demonstrate that the use of hierarchical models with attention mechanism improves the AD/HC discrimination performance.
In the light of the current COVID-19 pandemic, the need for remote digital health assessment tools is greater than ever. This statement is especially pertinent for elderly and vulnerable populations. In this regard, the INTERSPEECH 2020 Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) Challenge offers competitors the opportunity to develop speech and language-based systems for the task of Alzheimer's Dementia (AD) recognition. The challenge data consists of speech recordings and their transcripts, the work presented herein is an assessment of different contemporary approaches on these modalities. Specifically, we compared a hierarchical neural network with an attention mechanism trained on linguistic features with three acoustic-based systems: (i) Bag-of-Audio-Words (BoAW) quantising different low-level descriptors, (ii) a Siamese Network trained on log-Mel spectrograms, and (iii) a Convolutional Neural Network (CNN) end-to-end system trained on raw waveforms. Key results indicate the strength of the linguistic approach over the acoustics systems. Our strongest test-set result was achieved using a late fusion combination of BoAW, End-to-End CNN, and hierarchical-attention networks, which outperformed the challenge baseline in both the classification and regression tasks.
Exploring acoustic and linguistic information embedded in spontaneous speech recordings has proven to be efficient for automatic Alzheimer's dementia detection. Acoustic features can be extracted directly from the audio recordings, however, linguistic features, in fully automatic systems, need to be extracted from transcripts generated by an automatic speech recognition (ASR) system. We explore two state-of-the-art ASR paradigms, Wav2vec2.0 (for transcription and feature extraction) and time delay neural networks (TDNN) on the ADReSSo dataset containing recordings of people describing the Cookie Theft (CT) picture. As no manual transcripts are provided, we train an ASR system using our in-house CT data. We further investigate the use of confidence scores and multiple ASR hypotheses to guide and augment the input for the BERT-based classification. In total, five models are proposed for exploring how to use the audio recordings only for acoustic and linguistic information extraction. The test results on best acoustic-only and best linguisticonly are 74.65% and 84.51% respectively (representing a 15% and 9% relative increase to published baseline results).
Speech and language based automatic dementia detection is of interest due to it being non-invasive, low-cost and potentially able to aid diagnosis accuracy. The collected data are mostly audio recordings of spoken language and these can be used directly for acoustic-based analysis. To extract linguistic-based information, an automatic speech recognition (ASR) system is used to generate transcriptions. However, the extraction of reliable acoustic features is difficult when the acoustic quality of the data is poor as is the case with DementiaBank, the largest opensource dataset for Alzheimer's Disease classification. In this paper, we explore how to improve the robustness of the acoustic feature extraction by using time alignment information and confidence scores from the ASR system to identify audio segments of good quality. In addition, we design rhythminspired features and combine them with acoustic features. By classifying the combined features with a bidirectional-LSTM attention network, the F-measure improves from 62.15% to 70.75% when only the high-quality segments are used. Finally, we apply the same approach to our previously proposed hierarchical-based network using linguistic-based features and show improvement from 74.37% to 77.25%. By combining the acoustic and linguistic systems, a state-of-the-art 78.34% F-measure is achieved on the DementiaBank task.
Speech-based automatic approaches for detecting neurodegenerative disorders (ND) and mild cognitive impairment (MCI) have received more attention recently due to being non-invasive and potentially more sensitive than current pen-and-paper tests. The performance of such systems is highly dependent on the choice of features in the classification pipeline. In particular for acoustic features, arriving at a consensus for a best feature set has proven challenging. This paper explores using deep neural network for extracting features directly from the speech signal as a solution to this. Compared with hand-crafted features, more information is present in the raw waveform, but the feature extraction process becomes more complex and less interpretable which is often undesirable in medical domains. Using a Sinc-Net as a first layer allows for some analysis of learned features. We propose and evaluate the Sinc-CLA (with SincNet, Convolutional, Long Short-Term Memory and Attention layers) as a task-driven acoustic feature extractor for classifying MCI, ND and healthy controls (HC). Experiments are carried out on an inhouse dataset. Compared with the popular hand-crafted feature sets, the learned task-driven features achieve a superior classification accuracy. The filters of the SincNet is inspected and acoustic differences between HC, MCI and ND are found.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.