Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition

Savchenko, L. V.; Savchenko, Andrey V.

doi:10.1007/978-3-030-87802-3_55

Cited by 6 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Например, для набора данных Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [2] точность распознавания известных методов оказывалась равной 70 -80 % [13 -15], что может быть недостаточно для промышленного применения. Поэтому в предыдущей работе авторов [16] представлен новый подход, основанный на идее дикторозависимого распознавания речи [17,18], позволяющий увеличить точность распознавания эмоций по видеоизображению лиц путём адаптации модели под выражения лица конкретного пользователя.…”

Section: Introductionunclassified

“…Эти данные накапливаются и используются для дообучения нейросетевого классификатора для конкретного пользователя. При применении подобного подхода для распознавания выражений лиц по видео используются термины «дикторозависимая» (персональная) и «дикторонезависимая» (универсальная) модель, широко применяющиеся в литературе по распознаванию эмоций в речи [17]. Проведено обширное экспериментальное исследование с использованием разных моделей EmotiEffNet [12] для извлечения визуальных признаков из кадров.…”

Section: Introductionunclassified

See 1 more Smart Citation

Facial expression recognition based on adaptation of the classifier to videos of the user

Churaev,

Savchenko

2023

Computer Optics

View full text Add to dashboard Cite

In this paper, an approach that can significantly increase the accuracy of facial emotion recognition by adapting the model to the emotions of a particular user (e.g., smartphone owner) is considered. At the first stage, a neural network model, which was previously trained to recognize facial expressions in static photos, is used to extract visual features of faces in each frame. Next, the face features of video frames are aggregated into a single descriptor for a short video fragment. After that a neural network classifier is trained. At the second stage, it is proposed that adaptation (fine-tuning) to this classifier should be performed using a small set of video data with the facial expressions of a particular user. After emotion classification, the user can adjust the predicted emotions to further improve the accuracy of a personal model. As part of an experimental study for the RAVDESS dataset, it has been shown that the approach with model adaptation to a specific user can significantly (up to 20 – 50 %) improve the accuracy of facial expression recognition in the video.

show abstract

Section: Introductionunclassified

Facial expression recognition based on adaptation of the classifier to videos of the user

Churaev,

Savchenko

2023

Computer Optics

View full text Add to dashboard Cite

show abstract

“…[10] propose a cascade schema that progresses from macro-categories of emotions gradually towards the discrimination of more specific emotions. [11] proposes to fine-tune several SER classifiers for specific speakers and to select the classifier to use with a speaker recognition system. [12] applies a unsupervised multi-source domain adaptation strategy to learn emotion features independent from the speaker identity.…”

Section: Introductionmentioning

confidence: 99%

Quaternion Anti-Transfer Learning for Speech Emotion Recognition

Guizzo,

Weyde,

Tarroni

et al. 2023

2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

View full text Add to dashboard Cite

This study explores the benefits of anti-transfer learning with quaternion neural networks for robust, effective, and efficient speech emotion recognition. Anti-transfer learning selectively promotes task invariance through the introduction of a deep feature loss at training time. It has been shown to improve the performance of speech emotion recognition models by encouraging the independence of emotion predictions from specific uttered words and characteristics of the speaker's voice. However, the improved accuracy comes at a cost of increased computation time and memory requirements. In order to reduce the resource demand of anti-transfer, we propose to exploit quaternion-valued processing. We design, implement, and evaluate the use of quaternion anti-transfer learning on the basis of the VGG16 architecture and quaternion embeddings on multiple datasets for different speech emotion recognition task setups. The effectiveness of this approach depends on the layer where it is applied, with early layers offering a good compromise between performance gain and resource requirements. Our results show that anti-transfer in the quaternion domain can enhance generalisation while reducing the model's demand for computation and memory.

show abstract

Personalized Frame-Level Facial Expression Recognition in Video

Savchenko¹

2022

Pattern Recognition and Artificial Intelligence

View full text Add to dashboard Cite

Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition

Cited by 6 publications

References 16 publications

Facial expression recognition based on adaptation of the classifier to videos of the user

Facial expression recognition based on adaptation of the classifier to videos of the user

Quaternion Anti-Transfer Learning for Speech Emotion Recognition

Personalized Frame-Level Facial Expression Recognition in Video

Contact Info

Product

Resources

About