Analysis of human emotions from multimodal data for making critical decisions is an emerging area of research. The evolution of deep learning algorithms has improved the potential for extracting value from multimodal data. However, these algorithms do not often explain how certain outputs from the data are produced. This study focuses on the risks of using black-box deep learning models for critical tasks, such as emotion recognition, and describes how human understandable interpretations of these models are extremely important. This study utilizes one of the largest multimodal datasets available - CMU-MOSEI. Many researchers have used the pre-extracted features provided by the CMU Multimodal SDK with black-box deep learning models making it difficult to interpret the contribution of individual features. This study describes the implications of individual features from various modalities (audio, video, text) in Context-Aware Multimodal Emotion Recognition. It describes the process of curating reduced feature models by using the GradientSHAP XAI method. These reduced models with highly contributing features achieve comparable and even better results compared to their corresponding all feature models as well as the baseline model GraphMFN proving that carefully selecting significant features can help improve the model robustness and performance and in turn make it trustworthy.
Expert systems are being extensively used to make critical decisions involving emotional analysis in affective computing. The evolution of deep learning algorithms has improved the potential for extracting value from multimodal emotional data. However, these black‐box algorithms do not often explain the heuristics behind processing the input features for achieving certain outputs. This study focuses on the risks of using black‐box deep learning models for critical tasks, such as emotion recognition, and describes how human understandable interpretations of the workings of these models are extremely important. This study utilizes one of the largest multimodal datasets available–CMU‐MOSEI. Many researchers have used the pre‐extracted features provided by the CMU Multimodal SDK with black‐box deep learning models making it difficult to interpret the contribution of its individual features. This study describes the implications of significant features from various modalities (audio, video, text) identified using XAI in Multimodal Emotion Recognition. It describes the process of curating reduced feature models by using the Gradient SHAP XAI method. These reduced models with highly contributing features achieve comparable and at times even better results compared to their corresponding all‐feature models as well as the baseline model GraphMFN. This study reveals that carefully selecting significant features for a model can help filter out irrelevant features, and attenuate the noise or bias caused by them, leading to an improved performance efficiency of the expert systems by making them transparent, easily interpretable, and trustworthy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.