“…The first work in this direction relied on phone strings to represent the speech (Roy & Pentland, 2002;Roy, 2003), but more recently this learning has been shown to be possible directly on the speech signal (Synnaeve et al, 2014;Harwath & Glass, 2015;Harwath et al, 2016). Subsequent work on visually-grounded models of speech has investigated improvements and alternatives to the modeling or training algorithms (Leidal et al, 2017;Kamper et al, 2017c;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;a;Ilharco et al, 2019;Eloff et al, 2019a), application to multilingual settings (Harwath et al, 2018a;Kamper & Roth, 2017;Azuh et al, 2019;Havard et al, 2019a), analysis of the linguistic abstractions, such as words and phones, which are learned by the models Harwath et al, 2018b;Drexler & Glass, 2017;Havard et al, 2019b), and the impact of jointly training with textual input (Holzenberger et al, 2019;Chrupała, 2019;Pasad et al, 2019). Representations learned by models of visually grounded speech are also well-suited for transfer learning to supervised tasks, being highly robust to noise and domain shift .…”