Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification

Pan, Yilin; Mirheidari, Bahman; Tu, Zehai; O’Malley, Ronan; Walker, Traci; Venneri, Annalena; Reuber, Markus; Blackburn, Daniel; Christensen, Heidi

doi:10.21437/interspeech.2020-2684

Cited by 12 publications

(11 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, it has been shown that the first layers of end-to-end convolutional neural networks that learn representations from raw audio data extract features that are similar to the spectrogram or energies in mel-frequency bands [53][54][55]. Additionally, some works have addressed the design of the first layers of these networks to tailor the feature extraction stage using parametric filters [56][57][58] or trainable hand-crafted kernels [59,60]. Attention mechanisms have been used to bring interpretability to neural networks in speech and music emotion recognition [61,62] and in music auto-tagging [63].…”

Section: Relation With Previous Workmentioning

confidence: 99%

An Interpretable Deep Learning Model for Automatic Sound Classification

et al. 2021

View full text Add to dashboard Cite

Deep learning models have improved cutting-edge technologies in many research areas, but their black-box structure makes it difficult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or the reinforcement of biases. There is still a lack of research in the audio domain, despite the increasing interest in developing deep learning models that provide explanations of their decisions. To reduce this gap, we propose a novel interpretable deep learning model for automatic sound classification, which explains its predictions based on the similarity of the input to a set of learned prototypes in a latent space. We leverage domain knowledge by designing a frequency-dependent similarity measure and by considering different time-frequency resolutions in the feature space. The proposed model achieves results that are comparable to that of the state-of-the-art methods in three different sound classification tasks involving speech, music, and environmental audio. In addition, we present two automatic methods to prune the proposed model that exploit its interpretability. Our system is open source and it is accompanied by a web application for the manual editing of the model, which allows for a human-in-the-loop debugging approach.

show abstract

Section: Relation With Previous Workmentioning

confidence: 99%

An Interpretable Deep Learning Model for Automatic Sound Classification

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Previous studies [16] have shown that Sinc-CLA architecture has a good performance and interpretability in classifying recordings from people living with mild cognitive impairment, neurodegenerative disorders, or healthy controls. The multi-task Sinc-CLA system introduced in this paper is shown in Figure 3.…”

Section: End-to-end Systemmentioning

confidence: 99%

“…The SincNet Layer and CNN layers are shared by the two tasks, but the bi-directional LSTM and its following layers are separately trained with a specific target (age or MMSE). The detailed description of each functional layers can be found in Section 3.4 of this paper and in [16].…”

Section: End-to-end Systemmentioning

confidence: 99%

“…For the pipeline system, speaker embeddings like x-vector or i-vectors have achieved excellent performances for both age or MMSE estimation. On the other hand, the efficiency and interpretability of an end-to-end system named Sinc-CLA for neurodegenerative related disorder classification has been demostrated in [16]. In this paper, both of the two mainstream structures are demonstrated to be efficient for multi-task learning.…”

Section: Introductionmentioning

confidence: 97%

“…It has been found that in addition to paralinguistic acoustic feature sets [13], x-vector and ivector are also efficient for pathological speech detection [14,15]. In addition, inspired by the outstanding performance of deep neural networks used in numerous speech-based research areas, [16] proposed using a deep neural network for task-specific feature extractor learning and achieved superior performance.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Task Estimation of Age and Cognitive Decline from Speech

Pan

Nallanthighal

Blackburn

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speech is a common physiological signal that can be affected by both ageing and cognitive decline. Often the effect can be confounding, as would be the case for people at, e.g., very early stages of cognitive decline due to dementia. Despite this, the automatic predictions of age and cognitive decline based on cues found in the speech signal are generally treated as two separate tasks. In this paper, multi-task learning is applied for the joint estimation of age and the Mini-Mental Status Evaluation criteria (MMSE) commonly used to assess cognitive decline. To explore the relationship between age and MMSE, two neural network architectures are evaluated: a SincNet-based end-to-end architecture, and a system comprising of a feature extractor followed by a shallow neural network. Both are trained with single-task or multi-task targets. To compare, an SVM-based regressor is trained in a single-task setup. i-vector, xvector and ComParE features are explored. Results are obtained on systems trained on the DementiaBank dataset and tested on an inhouse dataset as well as the ADReSS dataset. The results show that both the age and MMSE estimation is improved by applying multitask learning, with state-of-the-art results achieved on the ADReSS dataset acoustic-only task.

show abstract

A longitudinal multi-modal dataset for dementia monitoring and diagnosis

Gkoumas,

Wang,

Tsakalidis

et al. 2024

Lang Resources & Evaluation

View full text Add to dashboard Cite

Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.

show abstract

Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification

Cited by 12 publications

References 21 publications

An Interpretable Deep Learning Model for Automatic Sound Classification

An Interpretable Deep Learning Model for Automatic Sound Classification

Multi-Task Estimation of Age and Cognitive Decline from Speech

A longitudinal multi-modal dataset for dementia monitoring and diagnosis

Contact Info

Product

Resources

About