“…Multimodality has been studied for around three decades in the context of social semiotics and its potential to help us understand the world around us led AI researchers to try and build models that can process information from multiple modalities through machine learning and social signal processing (Vinciarelli, Pantic, & Bourlard, 2009). The literature on multimodal prediction is rich with examples of audio-visual speech recognition (Zhou & De la Torre, 2012); multimedia content indexing and retrieval (Atrey, Hossain, El Saddik, & Kankanhalli, 2010), and multimodal affect recognition (D'mello & Kory, 2015;Grawemeyer et al, 2017). Learning from multimodal data provides opportunities to gain an in-depth understanding of complex processes and, for AI research to make progress, it should focus on multimodal AI models that can process and relate information from multiple modalities (Baltrušaitis, Ahuja, & Morency, 2019).…”