“…In terms of input, some models operate on linguistic abstractions of speech, such as phonemic, phonetic or orthographic transcripts (e.g., Frank et al, 2010;Goldwater et al, 2009;Nikolaus and Fourtassi, 2021), phonetic or lexical representations derived using pre-trained automatic speech recognition systems (e.g., Fourtassi and Dupoux, 2014;Roy, 2005;Salvi et al, 2012), or by using some simplified representations of acoustic speech, such as formant frequencies of pre-segmented vowels (Coen, 2006;de Boer and Kuhl, 2003). Another set of models operate directly on real continuous speech (e.g., Kamper et al, 2016;Nixon, 2020;Park and Glass, 2008;Schatz et al, 2021;Shain and Elsner, 2020). Besides processing language input only, there are models that use visual concurrent input in addition to spoken language (e.g., Alishahi et al, 2017;Chrupa la et al, 2017;Coen, 2006;Harwath et al, 2019;Harwath et al, 2016;Khorrami and Räsänen, 2021;Nikolaus and Fourtassi, 2021;Roy, 2005).…”