2022
DOI: 10.1111/cogs.13122
|View full text |Cite
|
Sign up to set email alerts
|

Cross‐Situational Word Learning With Multimodal Neural Networks

Abstract: In order to learn the mappings from words to referents, children must integrate co-occurrence information across individually ambiguous pairs of scenes and utterances, a challenge known as crosssituational word learning. In machine learning, recent multimodal neural networks have been shown to learn meaningful visual-linguistic mappings from cross-situational data, as needed to solve problems such as image captioning and visual question answering. These networks are potentially appealing as cognitive models be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 61 publications
(128 reference statements)
0
8
0
Order By: Relevance
“…CVCL's successes are implicitly supported, in part, by the child's actions, attention, and social engagement, although other benefits of active learning are beyond the model's reach (62). Third, children learn continually from an ongoing stream of experience, whereas CVCL learns by revisiting its training data repeatedly over multiple epochs, although continual contrastive learning has been successful too (63). Finally, young children must learn from speech whereas CVCL learns from transcribed utterances, trading useful speech cues like intonation and emphasis for explicit word boundaries (30).…”
Section: Discussionmentioning
confidence: 99%
“…CVCL's successes are implicitly supported, in part, by the child's actions, attention, and social engagement, although other benefits of active learning are beyond the model's reach (62). Third, children learn continually from an ongoing stream of experience, whereas CVCL learns by revisiting its training data repeatedly over multiple epochs, although continual contrastive learning has been successful too (63). Finally, young children must learn from speech whereas CVCL learns from transcribed utterances, trading useful speech cues like intonation and emphasis for explicit word boundaries (30).…”
Section: Discussionmentioning
confidence: 99%
“…We mainly focused on linguistic analyses that are applicable to text‐only setups, because this enables us to better isolate the contribution of introducing multimodality. Nevertheless, a very important future direction is to investigate grounded semantics of the language, with multimodal neural networks like our captioning model or contrastive models, using relevant tasks, such as image‐text matching or cross‐modal forced‐choice paradigms (Chrupała, Gelderloos, & Alishahi, 2017; Harwath et al., 2018; Khorrami & Räsänen, 2021; Kádár, Chrupała, & Alishahi, 2015; Lazaridou, Chrupała, Fernández, & Baroni, 2016; Nikolaus & Fourtassi, 2021; Vong & Lake, 2022). Moreover, we did not fully incorporate the temporal nature of a child's experience, both in how the videos were converted to still images (impeding learning of certain kinds of words that might require visuotemporal integration, e.g., “pick” and “take”; Ebert & Pavlick, 2020) and how networks were trained on the whole corpus simultaneously (one alternative, training networks on age‐ordered data, can be found in Huebner & Willits, 2020).…”
Section: Discussionmentioning
confidence: 99%
“…Another advantage of SAYCam‐S is its multimodality: it contains parallel vision and language inputs. Adding visual information provides grounding for words, potentially allowing the networks to learn references from words to objects, or at least visual features in the input (Hill et al., 2021; Vong & Lake, 2022). Multimodal learning has been shown to help resolve ambiguities when only linguistic information is present (Berzak, Barbu, Harari, Katz, & Ullman, 2015; Christie et al., 2016), induce constituent structures (Shi, Mao, Gimpel, & Livescu, 2019), and ground events described in language to video (Siddharth, Barbu, & Siskind, 2014; Yu, Siddharth, Barbu, & Siskind, 2015).…”
Section: Neural Network and Trainingmentioning
confidence: 99%
“…Multimodal models trained on visual question answering or reference games can use crosssituational learning to learn grounded meanings of words (Mao et al, 2019;Nikolaus and Fourtassi, 2021;Portelance et al, 2023). Nonetheless, computational models show different learning biases than humans in many cases, at least in the absence of specific training or architectural interventions (Gauthier et al, 2018;Vong and Lake, 2022). Ultimately, however, all of these studies are limited in their cognitive plausibility and language learning by a reliance on supervised training on small, artificial datasets in which texts and images correspond to arrangements of a limited set of objects in a simple, usually static scene.…”
Section: Cognitively Oriented Approachesmentioning
confidence: 99%