2024
DOI: 10.1016/j.eswa.2023.122946
|View full text |Cite
|
Sign up to set email alerts
|

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Mustaqeem Khan,
Wail Gueaieb,
Abdulmotaleb El Saddik
et al.
Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…The widely used "scaled dot-product attention" [53] practically requires transforming the three matrices of queries (Qs), keys (Ks), and values (Vs) into an output vector. In general, it computes the dot product of the query Q with all keys Ks, which is then divided by √ d k (d k , the dimension of queries), and finally, a softmax layer ( f softmax ) is applied to tune the weights on the values V. The attention can be computed as shown in Equation (15).…”
Section: Cross-attention Fusion Of Encoded Cuesmentioning
confidence: 99%
See 1 more Smart Citation
“…The widely used "scaled dot-product attention" [53] practically requires transforming the three matrices of queries (Qs), keys (Ks), and values (Vs) into an output vector. In general, it computes the dot product of the query Q with all keys Ks, which is then divided by √ d k (d k , the dimension of queries), and finally, a softmax layer ( f softmax ) is applied to tune the weights on the values V. The attention can be computed as shown in Equation (15).…”
Section: Cross-attention Fusion Of Encoded Cuesmentioning
confidence: 99%
“…Liu et al incorporate EEG, peripheral physiological signals, electrocardiogram, and movement features, and deep canonical correlation analysis and bimodal auto-encoder are extended for multi-modal signal fusion [13]. Emotion recognition has also been devoted to fusing acoustic, linguistic, and textual information [14], such as speech-text dual-modal frameworks [15,16], audio-visual-text tri-modal frameworks [17], and multi-modal frameworks using audio, text, facial expressions, and hand movements [18].…”
Section: Introductionmentioning
confidence: 99%
“…However, this method will inevitably result in huge computational and memory costs. In recent years, deep learning methodologies, particularly those involving deep convolutional neural networks (CNNs), have demonstrated exceptional prowess across various domains, including image restoration [58] and speech signal processing [59]. The fusion of deep learning techniques with sparse SAR imaging will emerge as a promising avenue for future research.…”
Section: Introductionmentioning
confidence: 99%