Application of Speech Emotion Recognition in Intelligent Household Robot

Xu, Huahu; Gao, Jue; Jian, Yuan

doi:10.1109/aici.2010.118

Cited by 39 publications

(20 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nowadays low cost devices can easily capture human emotion, which makes emotion recognition system more economically feasible for deployment. Hence automatic detection of user emotions has been applied to a variety of applications, including intelligent household robot for natural and friendly interaction with human beings [12] and fear type emotion recognition system dedicated to visual-audio surveillance [13].…”

Section: Emotion Recognitionmentioning

confidence: 99%

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Xie¹

2021

Preprint

View full text Add to dashboard Cite

Understanding human emotional states is indispensable for our daily interaction, and we can enjoy more natural and friendly human computer interaction (HCI) experience by fully utilizing human’s affective states. In the application of emotion recognition, multimodal information fusion is widely used to discover the relationships of multiple information sources and make joint use of a number of channels, such as speech, facial expression, gesture and physiological processes. This thesis proposes a new framework of emotion recognition using information fusion based on the estimation of information entropy. The novel techniques of information theoretic learning are applied to feature level fusion and score level fusion. The most critical issues for feature level fusion are feature transformation and dimensionality reduction. The existing methods depend on the second order statistics, which is only optimal for Gaussian-like distributions. By incorporating information theoretic tools, a new feature level fusion method based on kernel entropy component analysis is proposed. For score level fusion, most previous methods focus on predefined rule based approaches, which are usually heuristic. In this thesis, a connection between information fusion and maximum correntropy criterion is established for effective score level fusion. Feature level fusion and score level fusion methods are then combined to introduce a two-stage fusion platform. The proposed methods are applied to audiovisual emotion recognition, and their effectiveness is evaluated by experiments on two publicly available audiovisual emotion databases. The experimental results demonstrate that the proposed algorithms achieve improved performance in comparison with the existing methods. The work of this thesis offers a promising direction to design more advanced emotion recognition systems based on multimodal information fusion and has great significance to the development of intelligent human computer interaction systems.

show abstract

Section: Emotion Recognitionmentioning

confidence: 99%

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Xie¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Speech is an important carrier of emotions in human communication. Speech Emotion Recognition (SER) has wide application perspectives on psychological assessment [1], robots [2], mobile services [3], etc. For example, a psychologist can design a treatment plan according to the emotions hidden/expressed in the patient's speech.…”

Section: Introductionmentioning

confidence: 99%

“…• With area attention and VTLP-based data augmentation, we achieved the state-of-the-art on the IEMOCAP dataset with an WA of 79.34% and UA of 77.54%. 2…”

Section: Introductionmentioning

confidence: 99%

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Zhang²,

Cui³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In Speech Emotion Recognition (SER), emotional characteristics often appear in diverse forms of energy patterns in spectrograms. Typical attention neural network classifiers of SER are usually optimized on a fixed attention granularity. In this paper, we apply multiscale area attention in a deep convolutional neural network to attend emotional characteristics with varied granularities and therefore the classifier can benefit from an ensemble of attentions with different scales. To deal with data sparsity, we conduct data augmentation with vocal tract length perturbation (VTLP) to improve the generalization capability of the classifier. Experiments are carried out on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. We achieved 79.34% weighted accuracy (WA) and 77.54% unweighted accuracy (UA), which, to the best of our knowledge, is the state of the art on this dataset.

show abstract

“…Speech is an important carrier of human communication, and it makes sense to recognize emotions from speech. Speech Emotion Recognition(SER) has wide application prospects on psychological assessment [12], robots [13], mobile services [14], etc. For example, a psychologist formulates a treatment plan according to the emotions hidden in the patient's speech.…”

Section: Introductionmentioning

confidence: 99%

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Zhang²,

Zhang³

2021

IEEE Access

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER benefits Human-Computer Interaction(HCI). But there are still many problems in SER research, e.g., the lack of high-quality data, insufficient model accuracy, little research under noisy environments, etc. In this paper, we proposed a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER. We implemented an attentionbased convolutional neural network(ACNN) model and conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set. The accuracy is improved to 76.18% (weighted accuracy, WA) and 76.36% (unweighted accuracy, UA). To the best of our knowledge, compared with the state-ofthe-art result on this dataset (76.4% of WA and 70.1% of WA), we achieved a UA improvement of about 6% absolute while achieving a similar WA. Furthermore, We conducted empirical experiments by injecting speech data with 50 types of common noises. We inject the noises by altering the noise intensity, timeshifting the noises, and mixing different noise types, to identify their varied impacts on the SER accuracy and verify the robustness of our model. This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

show abstract

Application of Speech Emotion Recognition in Intelligent Household Robot

Cited by 39 publications

References 2 publications

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Audiovisual Emotion Recognition Using Entropy-estimation-based Multimodal Information Fusion

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Contact Info

Product

Resources

About