Multimodal emotion recognition is a research area that involves using signals of various natures, such as facial expressions, speech, gestures, and physiological signals, to recognize emotions accurately. This field has gained importance due to its potential applications in decision-making, human recognition, and social interaction. One of the significant challenges in multimodal emotion recognition is feature extraction, which involves identifying the relevant features from the signals that carry emotional information. Various techniques have been developed for feature extraction, including machine learning-based methods, such as deep learning, feature fusion, and feature selection. These techniques aim to extract the most relevant and discriminative features from the signals to improve the accuracy of emotion recognition.