2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018
DOI: 10.1109/iros.2018.8593571
|View full text |Cite
|
Sign up to set email alerts
|

On the Robustness of Speech Emotion Recognition for Human-Robot Interaction with Deep Neural Networks

Abstract: Speech emotion recognition (SER) is an important aspect of effective human-robot collaboration and received a lot of attention from the research community. For example, many neural network-based architectures were proposed recently and pushed the performance to a new level. However, the applicability of such neural SER models trained only on indomain data to noisy conditions is currently under-researched. In this work, we evaluate the robustness of state-of-the-art neural acoustic emotion recognition models in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
23
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 43 publications
(27 citation statements)
references
References 20 publications
4
23
0
Order By: Relevance
“…SER models trained on a single corpus tend to overfit, leading to poor performance on out-of-domain data, as presented in [3]. To address this issue, several techniques have been proposed: (1) the data augmentation approach, which consists of generating additional training samples by duplicating and often modifying the original training set, using techniques such as vocal tract length perturbation [22] or variation of tempo, loudness and background noise [23]; (2) multi-task learning, in which the models are trained on additional tasks, such as gender or domain identification [24,25]; (3) the transfer learning approach, in which the models are first trained on a given domain and then adapted to the task at hand [26,27]; and (4) cross-modal transfer, in which an image-based emotion recognition model is used to improve SER [28].…”
Section: Related Workmentioning
confidence: 99%
“…SER models trained on a single corpus tend to overfit, leading to poor performance on out-of-domain data, as presented in [3]. To address this issue, several techniques have been proposed: (1) the data augmentation approach, which consists of generating additional training samples by duplicating and often modifying the original training set, using techniques such as vocal tract length perturbation [22] or variation of tempo, loudness and background noise [23]; (2) multi-task learning, in which the models are trained on additional tasks, such as gender or domain identification [24,25]; (3) the transfer learning approach, in which the models are first trained on a given domain and then adapted to the task at hand [26,27]; and (4) cross-modal transfer, in which an image-based emotion recognition model is used to improve SER [28].…”
Section: Related Workmentioning
confidence: 99%
“…Previous research in end-to-end speech recognition demonstrated the importance of introducing random perturbations into the speech signal like a change of pitch, tempo, loudness, and adding noise [4], [25], [27], [28]. Since such perturbations do not alter the target label (spoken text in the case of speech recognition, or an emotion category), they can be conveniently applied with some occurrence probability during training.…”
Section: B Data Augmentationmentioning
confidence: 99%
“…We only use 4 speakers out of 13 speakers. In our previous work [2], [4] we use the iCub robotic head to test the robustness of our models against the robot's ego noise. However, in this paper, we use the Soundman wooden head to focus on background noise generated by the projectors, computers, air conditioner, power sources as well as noise from airplanes frequently passing nearby, and reverberation noise.…”
Section: B Human Robot Simulationmentioning
confidence: 99%
See 1 more Smart Citation
“…Here, HRI has utilized as a robotic humanoid head, which resulted in required emotional-feedback. The work [43], [44], [45] presents that the hybrid modality of deep learning might inherit the essential characteristics of RNN through CNN by the levels of convolutional implanted with the RNN. Here, this allows the method for attaining both temporal & frequency dependency in specified speechsignal.…”
Section: Features Of Voice Characteristicsmentioning
confidence: 99%