Integrating Deep and Shallow Models for Multi-Modal Depression Analysis—Hybrid Architectures

Yang, Le; Jiang, Dongmei; Sahli, Hichem

doi:10.1109/taffc.2018.2870398

Cited by 78 publications

(38 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we make use of our previously proposed speech acoustic feature for depression severity prediction from speech [34], [45]. The feature set is composed of the Geneva Minimalistic Acoustic Parameter Set (GeMAPS) and the INTERSPEECH Challenges feature sets.…”

Section: Feature Generation a Speech Acoustic Featuresmentioning

confidence: 99%

“…In total 238 low level descriptors (LLDs), consisting of 211 spectral and energy related features and 27 voicing related dynamic features are firstly extracted, then 25 statistical functionals and 4 regression functionals are performed resulting in a 6902 dimensional feature vector for each speech segment. The reader can refer to our previous work [34], [45] for more details. In this work, we consider the proposed features as speech descriptors to be learned by GAN.…”

Section: Feature Generation a Speech Acoustic Featuresmentioning

confidence: 99%

“…Moreover we also evaluate the sample size with respect to prediction accuracy. To be able to compre to previous works, we adopt a Deep Convolutional Neural Network (DCNN) as depression severity prediction model, its hyperparameters are set as in [45] including the number of layers and nodes. When training the model, we adopted an early-stop strategy.…”

Section: Depression Severity Predictionmentioning

confidence: 99%

See 2 more Smart Citations

Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals

2020

Self Cite

View full text Add to dashboard Cite

Depression disorder has become one of the major psychological diseases endangering human health. Researcher in the affective computing community is supporting the development of reliable depression severity estimation system, from multiple modalities (speech, face, text), to assist doctors in their diagnosis. However, the limited amount of annotated data has become the main bottleneck restricting the study on depression screening, especially when deep learning models are used. To alleviate this issue, in this work we propose to use Deep Convolutional Generative Adversarial Network (DCGAN) for features augmentation to improve depression severity estimation from speech. To the best of our knowledge, this approach is the first attempt to apply the Generative Adversarial Network for depression severity estimation from speech. Besides, to measure the quality of the augmented features, we propose three different measurement criteria, characterizing the spatial, frequency and representation learning of the augmented features. Finally, the augmented features are used to train depression estimation models. Experiments are carried out on speech signals from the Audio Visual Emotion Challenge (AVEC2016) depression dataset, and the relationship between the model performance and data size is explored. Our experimental results show that: 1) The combination of the three proposed evaluation criteria can effectively and comprehensively evaluate the quality of the augmented features. 2) When increasing the size of the augmented data, the performance of depression severity estimation gradually improves and the model converges to a certain stable state. 3) The proposed DCGAN based data augmentation approach effectively improves the performance of depression severity estimation, with the root mean square error (RMSE) reduced to 5.520 and mean absolute error (MAE) reduced to 4.634, which is better than most of the state of the art results on AVEC 2016. INDEX TERMS Depression estimation, audio features, data augmentation, deep convolutional generative adversarial network, spatial domain, frequency domain, deep learning aspect. HICHEM SAHLI is currently a Professor in computer vision and machine learning with the Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), and a Group-Coordinator with the Interuniversity Microelectronics Centre (IMEC). He coordinates the Joint VUB-NPU AudioVisual Signal Processing (AVSP) Laboratory. He has authored or coauthored over 310 refereed journal and conference papers. His research interests include theoretical and applied problems related to computer vision, machine learning, and signal, audio, and image processing, for applications linked to affective computing, multimodal interaction, and behavior analysis.

show abstract

Section: Feature Generation a Speech Acoustic Featuresmentioning

confidence: 99%

Section: Feature Generation a Speech Acoustic Featuresmentioning

confidence: 99%

Section: Depression Severity Predictionmentioning

confidence: 99%

See 1 more Smart Citation

Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Most current vision-based approaches to automatic depression analysis [17], [18], [19], [20], [21] base their prediction on the non-verbal facial behaviours of participants during an interview. There remain several challenges to achieve actionable results in this scenario, and our proposed approach mainly focus on addressing three of them.…”

Section: Introductionmentioning

confidence: 99%

Spectral Representation of Behaviour Primitives for Depression Analysis

Song

Jaiswal

Shen

et al. 2022

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Depression is a serious mental disorder affecting millions of people all over the world. Traditional clinical diagnosis methods are subjective, complicated and require extensive participation of clinicians. Recent advances in automatic depression analysis systems promise a future where these shortcomings are addressed by objective, repeatable, and readily available diagnostic tools to aid health professionals in their work. Yet there remain a number of barriers to the development of such tools. One barrier is that existing automatic depression analysis algorithms base their predictions on very brief sequential segments, sometimes as little as one frame. Another barrier is that existing methods do not take into account what the context of the measured behaviour is. In this paper, we extract multi-scale video-level features for video-based automatic depression analysis. We propose to use automatically detected human behaviour primitives as the low-dimensional descriptor for each frame. We also propose two novel spectral representations, i.e. spectral heatmaps and spectral vectors, to represent video-level multi-scale temporal dynamics of expressive behaviour. Constructed spectral representations are fed to Convolution Neural Networks (CNNs) and Artificial Neural Networks (ANNs) for depression analysis. We conducted experiments on the AVEC 2013 and AVEC 2014 benchmark datasets to investigate the influence of interview tasks on depression analysis. In addition to achieving state of the art accuracy in severity of depression estimation, we show that the task conducted by the user matters, that fusion of a combination of tasks reaches highest accuracy, and that longer tasks are more informative than shorter tasks, up to a point.

show abstract

“…Tested on AVEC2014 [19], the method could predict with over 80% accuracy depressive behavior. Yang et al [25] proposed a system for hybridizing deep and shallow models for depression prediction from audio, video, and text descriptors. Researchers employed a DCNN–DNN model for audio-visual multimodal depression recognition using the PHQ-8 framework, and a paragraph vector (PV) analyzed the interview transcript using an SVM-based model in order to infer the physical and mental conditions of the subject.…”

Section: Introductionmentioning

confidence: 99%

Predicting Depression, Anxiety, and Stress Levels from Videos Using the Facial Action Coding System

Gavrilescu

Vizireanu

2019

Sensors

104

View full text Add to dashboard Cite

We present the first study in the literature that has aimed to determine Depression Anxiety Stress Scale (DASS) levels by analyzing facial expressions using Facial Action Coding System (FACS) by means of a unique noninvasive architecture on three layers designed to offer high accuracy and fast convergence: in the first layer, Active Appearance Models (AAM) and a set of multiclass Support Vector Machines (SVM) are used for Action Unit (AU) classification; in the second layer, a matrix is built containing the AUs’ intensity levels; and in the third layer, an optimal feedforward neural network (FFNN) analyzes the matrix from the second layer in a pattern recognition task, predicting the DASS levels. We obtained 87.2% accuracy for depression, 77.9% for anxiety, and 90.2% for stress. The average prediction time was 64 s, and the architecture could be used in real time, allowing health practitioners to evaluate the evolution of DASS levels over time. The architecture could discriminate with 93% accuracy between healthy subjects and those affected by Major Depressive Disorder (MDD) or Post-traumatic Stress Disorder (PTSD), and 85% for Generalized Anxiety Disorder (GAD). For the first time in the literature, we determined a set of correlations between DASS, induced emotions, and FACS, which led to an increase in accuracy of 5%. When tested on AVEC 2014 and ANUStressDB, the method offered 5% higher accuracy, sensitivity, and specificity compared to other state-of-the-art methods.

show abstract

Integrating Deep and Shallow Models for Multi-Modal Depression Analysis—Hybrid Architectures

Cited by 78 publications

References 67 publications

Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals

Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals

Spectral Representation of Behaviour Primitives for Depression Analysis

Predicting Depression, Anxiety, and Stress Levels from Videos Using the Facial Action Coding System

Contact Info

Product

Resources

About