A Computational Model of Early Language Acquisition from Audiovisual Experiences of Young Infants

Räsänen, Okko; Khorrami, Khazar

doi:10.21437/interspeech.2019-1523

Cited by 9 publications

(9 citation statements)

References 36 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, when implemented as a neural network with several hidden layers, these hidden layers start to reflect selectivity towards different types of linguistic units that the input speech consist of. This is in line with earlier findings using neural network models using supervised training (Nagamine et al, 2015; see also Magnuson et al, 2020) or simplified visual input (Räsänen and Khorrami, 2019). Here we show that similar emergence of units can be observed in learning conditions analogous to cross-situational learning.…”

Section: Discussionsupporting

confidence: 93%

“…Recently, Räsänen and Khorrami (2019) trained a weakly supervised convolutional neural network (CNN) VGS model to map acoustic speech to the labels of concurrently visible objects attended by the baby hearing the speech, as extracted from head-mounted video data from real infant-caregiver interactions of English-learning infants (Bergelson & Aslin, 2017). They then measured the so-called phoneme selectivity index (PSI) (Mesgarani et al, 2014) of the network nodes and layers.…”

Section: Earlier Related Workmentioning

confidence: 99%

“…In the future work, it would be important to test the audiovisual models with real infant language and visual input. Some baby steps to this direction already exist (Räsänen & Khorrami, 2019), but systematic investigation at the scale of real infant language experiences would be ideal to understand the role of visual 4 experience in early organization of language. Ideally, datasets from several different languages would be also utilized and compared.…”

Section: Limitations Of the Present Studymentioning

confidence: 99%

“…In contrast to viewing language learning as a composition of different learning tasks, an alternative picture of the process can also be painted: what if processes such as word segmentation or phonetic category acquisition are not necessary stepping stones for speech comprehension, but that language learning could be bootstrapped by meaning-driven predictive learning, where the learner attempts to connect the (initially unsegmented) auditory stream to the objects and events in the observable surroundings (Johnson et al, 2010;Räsänen and Rasilo, 2015; also referred to as discriminative learning in Baayen et al, 2015; see also Ramscar and Port, 2016). While tackling this idea has been challenging in empirical terms, a number of computational studies have explored this idea along the years (e.g., but not limited to, Yu et al, 2005;Roy and Pentland, 2002;Räsänen and Rasilo, 2015;Chrupała et al, 2017;Alishahi et al, 2017;Räsänen and Khorrami, 2019;ten Bosch et al, 2008;Ballard and Yu, 2004). These models have demonstrated successful learning of speech comprehension skills in terms of connecting words in continuous speech to their visual referents with minimal or fully absent prior linguistic knowledge.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

show abstract

Section: Discussionsupporting

confidence: 93%

Section: Earlier Related Workmentioning

confidence: 99%

Section: Limitations Of the Present Studymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In computational studies, researchers have built models that implement in-principle learning algorithms, and created training sets to test the abilities of the models to find statistical regularities in the input data. Some work in modeling word learning has used sensory data collected from adult learners or robots (Roy & Pentland, 2002;Yu & Ballard, 2007;Rasanen & Khorrami, 2019), while many models take symbolic data or simplified inputs (Frank et al, 2009;Kachergis & Yu, 2017;K. Smith, Smith, & Blythe, 2011;Fazly, Alishahi, & Stevenson, 2010;Yu & Ballard, 2007).…”

Section: Introductionmentioning

confidence: 99%

A Computational Model of Early Word Learning from the Infant's Point of View

Tsutsui,

Chandrasekaran,

Reza

et al. 2020

Preprint

View full text Add to dashboard Cite

Human infants have the remarkable ability to learn the associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science and developmental psychology have built formal models that implement in-principle learning algorithms, and then used preselected and pre-cleaned datasets to test the abilities of the models to find statistical regularities in the input data. In contrast to previous modeling approaches, the present study used egocentric video and gaze data collected from infant learners during natural toy play with their parents. This allowed us to capture the learning environment from the perspective of the learner's own point of view. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners. Moreover, we conducted simulation experiments to systematically determine how visual, perceptual, and attentional properties of infants' sensory experiences may affect word learning.

show abstract

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

et al. 2022

View full text Add to dashboard Cite

Many computational models of speech recognition assume that the set of target words is already given. This implies that these models learn to recognise speech in a biologically unrealistic manner, i.e. with prior lexical knowledge and explicit supervision. In contrast, visually grounded speech models learn to recognise speech without prior lexical knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that visually grounded speech models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition. We investigate the time course of noun and verb recognition as simulated by the model using a gating paradigm to test whether its recognition is affected by well-known word competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for discrete representation learning, aids the model in the discovery and recognition of words. Our experiments show that the model is able to recognise nouns in isolation and even learns to properly differentiate between plural and singular nouns. We also find that recognition is influenced by word competition from the word-initial cohort and neighbourhood density, mirroring word competition effects in human speech comprehension. Lastly, we find no evidence that vector quantisation is helpful in discovering and recognising words, though our gating experiment does show that the LSTM-VQ model is able to recognise the target words earlier.

show abstract

A Computational Model of Early Language Acquisition from Audiovisual Experiences of Young Infants

Cited by 9 publications

References 36 publications

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

A Computational Model of Early Word Learning from the Infant's Point of View

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

Contact Info

Product

Resources

About