Self-attention fusion for audiovisual emotion recognition with incomplete data

Chumachenko, Kateryna; Iosifidis, Alexandros; Gabbouj, Moncef

doi:10.1109/icpr56361.2022.9956592

Cited by 32 publications

(3 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Softmax activation promotes competition in the attention matrix [43], thus highlighting more important attributes and timestamps of each modality. As a result, it provides the importance score of each key relative to each query, that is, the importance of each representation of modality α with respect to modality β. Consequently, the features that exhibit agreement between the two modalities exert the greatest influence on the final prediction, thereby guiding the model to learn features that demonstrate a substantial level of agreement across modalities.…”

Section: Non-invasive Modal Fusion Transformer Encodermentioning

confidence: 99%

Transferable non-invasive modal fusion-transformer (NIMFT) for end-to-end hand gesture recognition

Xu,

Zhao,

et al. 2024

J. Neural Eng.

View full text Add to dashboard Cite

Objective. Recent studies have shown that integrating IMU signals with surface electromyographic (sEMG) can greatly improve hand gesture recognition (HGR) performance in applications such as prosthetic control and rehabilitation training. However, current deep learning models for multimodal HGR encounter difficulties in invasive modal fusion, complex feature extraction from heterogeneous signals, and limited inter-subject model generalization. To address these challenges, this study aims to develop an end-to-end and inter-subject transferable model that utilizes non-invasively fused sEMG and acceleration (ACC) data. Approach. The proposed NIMFT model utilizes 1D-CNN-based patch embedding for local information extraction and employs a multi-head cross-attention (MCA) mechanism to non-invasively integrate sEMG and ACC signals, stabilizing the variability induced by sEMG. The proposed architecture undergoes detailed ablation studies after hyperparameter tuning. Transfer learning is employed by fine-tuning a pre-trained model on new subject and a comparative analysis is performed between the fine-tuning and subject-specific model. Additionally, the performance of NIMFT is compared to state-of-the-art fusion models. Main results. The NIMFT model achieved recognition accuracies of 93.91%, 91.02%, and 95.56% on the three action sets in the Ninapro DB2 dataset. The proposed embedding method and MCA outperformed the traditional invasive modal fusion transformer by 2.01% (embedding) and 1.23% (fusion), respectively. In comparison to subject-specific models, the fine-tuning model exhibited the highest average accuracy improvement of 2.26%, achieving a final accuracy of 96.13%. Moreover, the NIMFT model demonstrated superiority in terms of accuracy, recall, precision, and F1-score compared to the latest modal fusion models with similar model scale. Significance. The NIMFT is a novel end-to-end HGR model, utilizes a non-invasive MCA mechanism to integrate long-range intermodal information effectively. Compared to recent modal fusion models, it demonstrates superior performance in inter-subject experiments and offers higher training efficiency and accuracy levels through transfer learning than subject-specific approaches.

show abstract

Section: Non-invasive Modal Fusion Transformer Encodermentioning

confidence: 99%

Transferable non-invasive modal fusion-transformer (NIMFT) for end-to-end hand gesture recognition

Xu,

Zhao,

et al. 2024

J. Neural Eng.

View full text Add to dashboard Cite

show abstract

“…Furthermore, in addition to these brain-inspired methods, we compare the performance of M ulT (Chumachenko, Iosifidis, and Gabbouj 2022) with our model specifically in the RAVDESS dataset. The accuracy achieved by MulT on seven classes is 74.16%, and our model outperforms MulT by 25.47%.…”

Section: Overall Performancementioning

confidence: 99%

ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model

Wang,

Fan,

Jia

et al. 2024

AAAI

View full text Add to dashboard Cite

Cross-sensory interaction is a key aspect for multisensory recognition. Without cross-sensory interaction, artificial neural networks show inferior performance in multisensory recognition. On the contrary, the human brain has an inherently remarkable ability in multisensory recognition, which stems from the diverse neurons that exhibit distinct responses to sensory inputs, especially the multisensory neurons with multisensory responses hence enabling cross-sensory interaction. Based on this neuronal diversity, we propose a Neuronal Diversity inspired Multisensory Recognition Model (ND-MRM), which, similar to the brain, comprises unisensory neurons and multisensory neurons. To reflect the different responses characteristics of diverse neurons in the brain, special connection constraints are innovatively designed to regulate the features transmission in the ND-MRM. Leveraging this novel concept of neuronal diversity, our model is biologically plausible, enabling more effective recognition of multisensory information. To validate the performance of the proposed ND-MRM, we employ a multisensory emotion recognition task as a case study. The results demonstrate that our model surpasses state-of-the-art brain-inspired baselines on two datasets, proving the potential of brain-inspired methods for advancing multisensory interaction and recognition.

show abstract

“…Recently, several multimodal deep-learning models have been designed to be trained concurrently with data of multiple modalities, such as vision, auditory, and sensor data. In particular, many studies [7][8][9][10][11] have proposed training speech-recognition models using various forms of data such as audio and text. Trained multimodal deep-learning models can train from diverse information and achieve a high prediction accuracy.…”

Section: Multimodal Deep Learningmentioning

confidence: 99%

Multimodal Prompt Learning in Emotion Recognition Using Context and Audio Information

2023

View full text Add to dashboard Cite

Prompt learning has improved the performance of language models by reducing the gap in language model training methods of pre-training and downstream tasks. However, extending prompt learning in language models pre-trained with unimodal data to multimodal sources is difficult as it requires additional deep-learning layers that cannot be attached. In the natural-language emotion-recognition task, improved emotional classification can be expected when using audio and text to train a model rather than only natural-language text. Audio information, such as voice pitch, tone, and intonation, can give more information that is unavailable in text to predict emotions more effectively. Thus, using both audio and text can enable better emotion prediction in speech emotion-recognition models compared to semantic information alone. In this paper, in contrast to existing studies that use multimodal data with an additional layer, we propose a method for improving the performance of speech emotion recognition using multimodal prompt learning with text-based pre-trained models. The proposed method is using text and audio information in prompt learning by employing a language model pre-trained on natural-language text. In addition, we propose a method to improve the emotion-recognition performance of the current utterance using the emotion and contextual information of the previous utterances for prompt learning in speech emotion-recognition tasks. The performance of the proposed method was evaluated using the English multimodal dataset MELD and the Korean multimodal dataset KEMDy20. Experiments using both the proposed methods obtained an accuracy of 87.49%, F1 score of 44.16, and weighted F1 score of 86.28.

show abstract

Self-attention fusion for audiovisual emotion recognition with incomplete data

Cited by 32 publications

References 13 publications

Transferable non-invasive modal fusion-transformer (NIMFT) for end-to-end hand gesture recognition

Transferable non-invasive modal fusion-transformer (NIMFT) for end-to-end hand gesture recognition

ND-MRM: Neuronal Diversity Inspired Multisensory Recognition Model

Multimodal Prompt Learning in Emotion Recognition Using Context and Audio Information

Contact Info

Product

Resources

About