Speech depression recognition based on attentional residual network

Lu, Xin; Shi, Daimin; Liu, Yang; Yuan, Jingyi

doi:10.52586/5066

Cited by 15 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ditthapron et al [ 27 ] used smartphones to passively capture changes in acoustic characteristics of spontaneous speech for continuous traumatic brain injury monitoring. Spontaneous speech can also be used for research on depression [ 53 ] and aphasia [ 28 , 54 ]. Thus, our proposed spontaneous speech-based approach has the potential to be used in other clinical populations with acquired language disorders.…”

Section: Discussionmentioning

confidence: 99%

Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech

Liu

Fan

et al. 2023

Brain Sciences

View full text Add to dashboard Cite

Clinical studies have shown that speech pauses can reflect the cognitive function differences between Alzheimer’s Disease (AD) and non-AD patients, while the value of pause information in AD detection has not been fully explored. Herein, we propose a speech pause feature extraction and encoding strategy for only acoustic-signal-based AD detection. First, a voice activity detection (VAD) method was constructed to detect pause/non-pause feature and encode it to binary pause sequences that are easier to calculate. Then, an ensemble machine-learning-based approach was proposed for the classification of AD from the participants’ spontaneous speech, based on the VAD Pause feature sequence and common acoustic feature sets (ComParE and eGeMAPS). The proposed pause feature sequence was verified in five machine-learning models. The validation data included two public challenge datasets (ADReSS and ADReSSo, English voice) and a local dataset (10 audio recordings containing five patients and five controls, Chinese voice). Results showed that the VAD Pause feature was more effective than common feature sets (ComParE: 6373 features and eGeMAPS: 88 features) for AD classification, and that the ensemble method improved the accuracy by more than 5% compared to several baseline methods (8% on the ADReSS dataset; 5.9% on the ADReSSo dataset). Moreover, the pause-sequence-based AD detection method could achieve 80% accuracy on the local dataset. Our study further demonstrated the potential of pause information in speech-based AD detection, and also contributed to a more accessible and general pause feature extraction and encoding method for AD detection.

show abstract

Section: Discussionmentioning

confidence: 99%

Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech

Liu

Fan

et al. 2023

Brain Sciences

View full text Add to dashboard Cite

show abstract

“…To collect sensor data with sensors by using our tool presented in [ 14 ] would have required us to be in presence, because it is only a prototype, not easy to use. For this reason, we preferred to use pictures and speech in this preliminary version, reassured that speech is largely adopted in recent study on depression, see for example ([ 24 , 30 , 37 , 44 ]).…”

Section: Discussion and Limitationsmentioning

confidence: 99%

“…BDI-II score has been adopted to label the training dataset. Also He et al [ 24 ] used BDI-II scores for verifying their depression prediction when applying attentional residual network on Videos of the AVEC2013 and AVEC2014 datasets. The model also estimated the severity of depression.…”

Section: Introductionmentioning

confidence: 99%

Emotion detection for supporting depression screening

Francese

Attanasio

2022

Multimed Tools Appl

View full text Add to dashboard Cite

Depression is the most prevalent mental disorder in the world. One of the most adopted tools for depression screening is the Beck Depression Inventory-II (BDI-II) questionnaire. Patients may minimize or exaggerate their answers. Thus, to further examine the patient’s mood while filling in the questionnaire, we propose a mobile application that captures the BDI-II patient’s responses together with their images and speech. Deep learning techniques such as Convolutional Neural Networks analyze the patient’s audio and image data. The application displays the correlation between the patient’s emotional scores and DBI-II scores to the clinician at the end of the questionnaire, indicating the relationship between the patient’s emotional state and the depression screening score. We conducted a preliminary evaluation involving clinicians and patients to assess (i) the acceptability of proposed application for use in clinics and (ii) the patient user experience. The participants were eight clinicians who tried the tool with 21 of their patients. The results seem to confirm the acceptability of the app in clinical practice.

show abstract

“…Manual features such as spectral, source, prosodic, and formant features are commonly employed when analyzing depression and suicidality ( Cummins et al, 2015 ). Moreover, these features are also regarded as inputs to deep neural networks ( Lang and Cui, 2018 ; Lu X. et al, 2021 ). Studies have shown that the advanced features generated by MFCC feeding into the Short Long-Term Memory (LSTM) can preserve information related to depression ( Rejaibi et al, 2022 ).…”

Section: Related Workmentioning

confidence: 99%

Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection

Liu

Li³

et al. 2023

Front. Neurosci.

View full text Add to dashboard Cite

IntroductionAs a biomarker of depression, speech signal has attracted the interest of many researchers due to its characteristics of easy collection and non-invasive. However, subjects’ speech variation under different scenes and emotional stimuli, the insufficient amount of depression speech data for deep learning, and the variable length of speech frame-level features have an impact on the recognition performance.MethodsThe above problems, this study proposes a multi-task ensemble learning method based on speaker embeddings for depression classification. First, we extract the Mel Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Predictive Coefficients (PLP), and the Filter Bank (FBANK) from the out-domain dataset (CN-Celeb) and train the Resnet x-vector extractor, Time delay neural network (TDNN) x-vector extractor, and i-vector extractor. Then, we extract the corresponding speaker embeddings of fixed length from the depression speech database of the Gansu Provincial Key Laboratory of Wearable Computing. Support Vector Machine (SVM) and Random Forest (RF) are used to obtain the classification results of speaker embeddings in nine speech tasks. To make full use of the information of speech tasks with different scenes and emotions, we aggregate the classification results of nine tasks into new features and then obtain the final classification results by using Multilayer Perceptron (MLP). In order to take advantage of the complementary effects of different features, Resnet x-vectors based on different acoustic features are fused in the ensemble learning method.ResultsExperimental results demonstrate that (1) MFCC-based Resnet x-vectors perform best among the nine speaker embeddings for depression detection; (2) interview speech is better than picture descriptions speech, and neutral stimulus is the best among the three emotional valences in the depression recognition task; (3) our multi-task ensemble learning method with MFCC-based Resnet x-vectors can effectively identify depressed patients; (4) in all cases, the combination of MFCC-based Resnet x-vectors and PLP-based Resnet x-vectors in our ensemble learning method achieves the best results, outperforming other literature studies using the depression speech database.DiscussionOur multi-task ensemble learning method with MFCC-based Resnet x-vectors can fuse the depression related information of different stimuli effectively, which provides a new approach for depression detection. The limitation of this method is that speaker embeddings extractors were pre-trained on the out-domain dataset. We will consider using the augmented in-domain dataset for pre-training to improve the depression recognition performance further.

show abstract

Speech depression recognition based on attentional residual network

Cited by 15 publications

References 22 publications

Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech

Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech

Emotion detection for supporting depression screening

Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection

Contact Info

Product

Resources

About