“…We mainly focused on linguistic analyses that are applicable to text‐only setups, because this enables us to better isolate the contribution of introducing multimodality. Nevertheless, a very important future direction is to investigate grounded semantics of the language, with multimodal neural networks like our captioning model or contrastive models, using relevant tasks, such as image‐text matching or cross‐modal forced‐choice paradigms (Chrupała, Gelderloos, & Alishahi, 2017; Harwath et al., 2018; Khorrami & Räsänen, 2021; Kádár, Chrupała, & Alishahi, 2015; Lazaridou, Chrupała, Fernández, & Baroni, 2016; Nikolaus & Fourtassi, 2021; Vong & Lake, 2022). Moreover, we did not fully incorporate the temporal nature of a child's experience, both in how the videos were converted to still images (impeding learning of certain kinds of words that might require visuotemporal integration, e.g., “pick” and “take”; Ebert & Pavlick, 2020) and how networks were trained on the whole corpus simultaneously (one alternative, training networks on age‐ordered data, can be found in Huebner & Willits, 2020).…”