ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054621
|View full text |Cite
|
Sign up to set email alerts
|

Deep Encoded Linguistic and Acoustic Cues for Attention Based End to End Speech Emotion Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(8 citation statements)
references
References 14 publications
0
8
0
Order By: Relevance
“…This development can be attributed to the fast pace of solutions to visual task recognition [14]. In [15], the authors introduced a comprehensive model incorporating convolutional layers and a multi-head selfattention mechanism, which utilized deep encoded linguistic information and audio spectrogram representation to perform emotion recognition in speech. Also, to address the class imbalance in their dataset, they carried out down-sampling and ensembling, further improving the SER accuracy.…”
Section: Related Workmentioning
confidence: 99%
“…This development can be attributed to the fast pace of solutions to visual task recognition [14]. In [15], the authors introduced a comprehensive model incorporating convolutional layers and a multi-head selfattention mechanism, which utilized deep encoded linguistic information and audio spectrogram representation to perform emotion recognition in speech. Also, to address the class imbalance in their dataset, they carried out down-sampling and ensembling, further improving the SER accuracy.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, it might be beneficial to allow attention mechanisms to combine different representation subspaces using queries, keys, and values. Bhosale et al [22] were the first to combine an attention mechanism with an RNN. They calculated alignment-probability matrices for input and output series in an encoder-decoder model, effectively solving machine translation issues.…”
Section: Related Workmentioning
confidence: 99%
“…In essence, given that the input matrices Q and K are of dimension d k and V is of dimension d v , the operation executes the matrix multiplication of Q and each K, scales the product by √ d k , and thereafter applies the softmax function to ascertain the weights. The output matrix is articulated as follows [22]:…”
mentioning
confidence: 99%
“…Research on the transfer of Speech Emotion Recognition (SER) models to the MER regression task [10], has shown successful transfer learning for recognition of emotions from speech in English to classical music. 1 In the case of SER, the work on linguistic research has gained more importance lately: using a bag-of-audio-words approach to exploit linguistic features [14], combining speech-based and linguistic classifiers [15], or using linguistic and acoustic cues for end-to-end models [16]. The linguistic approach to emotion recognition is mainly due to the fact that different emotion adjectives will tend to have diverse meanings across cultures [17] and that translation of words might result questionable [8].…”
Section: Related Workmentioning
confidence: 99%