“…Most studies mainly focus on a single modality, which may include facial expressions [6], [7], voice [8], [9], text [10], [11], body posture [12], gaits [13], or physiological signals [14]. Multimodal processing and recognition has demonstrated great potential in numerous practical emotion recognition applications such as combining speech and facial features [15], [16], electroencephalography signal (EEG) and eye gaze [17], audio and written text [18], [19], visual, audio and text [20], [21], and modalities' features and contextual information [21], [22], [23].…”