“…Another set of models operate directly on real continuous speech (e.g., Kamper et al, 2016;Nixon, 2020;Park and Glass, 2008;Schatz et al, 2021;Shain and Elsner, 2020). Besides processing language input only, there are models that use visual concurrent input in addition to spoken language (e.g., Alishahi et al, 2017;Chrupa la et al, 2017;Coen, 2006;Harwath et al, 2019;Harwath et al, 2016;Khorrami and Räsänen, 2021;Nikolaus and Fourtassi, 2021;Roy, 2005). Besides passive perception approaches, there are also models that can interact with simulated or real human caregivers (e.g., Howard and Messum, 2011;Rasilo and Räsänen, 2017) and studies using multiple computational agents that can interact with each other using some communicative means (e.g., Kirby, 2001;Moulin-Frier et al, 2015;Oudeyer, 2005; see also Oudeyer et al, 2019, for a recent review).…”