“…Most common multimodal settings combined landmarks, body/head pose, or visual cues with past utterance transcriptions (Chu et al, 2018;Hua et al, 2019;Ueno et al, 2020;), acoustic features (Türker et al, 2018Ahuja et al, 2019;Ueno et al, 2020;Goswami et al, 2020;Woo et al, 2021;Jain and Leekha, 2021;Murray et al, 2021;Ben-Youssef et al, 2021), speaker's metadata (Raman et al, 2021;, or with combinations of the previous modalities (Ishii et al, 2020;Huang et al, 2020;Blache et al, 2020;Ishii et al, 2021;Boudin et al, 2021). The most common way to exploit different modalities together consists in simply concatenating their embedded representations.…”