“…Human emotions are expressed in, and can accordingly be identified from, different modalities, such as speech, gestures, and facial expressions [4,5,6]. Consequently, the research community has long sought efficient ways to utilise multimodal information in an attempt to improve recognition performance and arrive at a more holistic understanding of human behaviour and communication [7,8]. A plethora of such works have investigated different ways to improve AER by combining several information streams: e. g. audio, video, text, gestures, and physiological signals.…”