Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami, Khazar; Räsänen, Okko

doi:10.34842/w3vw-s845

Cited by 5 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In other words, inputs that are similar in the MFCC domain are also likely to be similar in the latent encoding space, whereas inputs that are distant in the input space are also likely to be distant in the latent space before any learning has taken place. Since MFCCs already carry phonemic information while ignoring some non-phonemic variability (e.g., F0) due to their design, the corresponding latent encodings are also likely be discriminative with respect to vowel categories (see also, e.g., Chrupa la et al, 2020;Khorrami and Räsänen, 2021). In this context, it appears that much of the vowel discrimination is already explainable in terms of the input features, but the discriminatory characteristics of the latents then change somewhat as the models learn from input data: CPC improves on both native and non-native contrasts, while APC is relatively stable on the native contrasts and degrades on the non-native ones.…”

Section: Discussion For Experiments #2mentioning

confidence: 99%

“…Another set of models operate directly on real continuous speech (e.g., Kamper et al, 2016;Nixon, 2020;Park and Glass, 2008;Schatz et al, 2021;Shain and Elsner, 2020). Besides processing language input only, there are models that use visual concurrent input in addition to spoken language (e.g., Alishahi et al, 2017;Chrupa la et al, 2017;Coen, 2006;Harwath et al, 2019;Harwath et al, 2016;Khorrami and Räsänen, 2021;Nikolaus and Fourtassi, 2021;Roy, 2005). Besides passive perception approaches, there are also models that can interact with simulated or real human caregivers (e.g., Howard and Messum, 2011;Rasilo and Räsänen, 2017) and studies using multiple computational agents that can interact with each other using some communicative means (e.g., Kirby, 2001;Moulin-Frier et al, 2015;Oudeyer, 2005; see also Oudeyer et al, 2019, for a recent review).…”

Section: Previous Workmentioning

confidence: 99%

“…Since these models demonstrate emerging understanding of the semantics between auditory speech and the visual world without ever being explicitly taught about the structure of either modality, researchers have become interested on whether the internal representations of these models also show signs of emergent linguistic organization. Studied VGS model capabilities have included, e.g., phonetic, syllabic, and lexical representations and their temporal segmentation, lexical semantics, and, e.g., temporal competition in word activations (Alishahi et al, 2017;Chrupa la et al, 2017;Harwath and Glass, 2019;Havard et al, 2019;Khorrami and Räsänen, 2021;Merkx et al, 2019; see also Chrupala, 2021, for a recent review). At the time of writing, the so-far distinct activities in Zerospeech and VGS evaluation have become integrated in a new multimodal extension of the ZeroSpeech challenge (Alishahi et al, 2021).…”

Section: Model Evaluation On Multiple Criteriamentioning

confidence: 99%

See 2 more Smart Citations

Introducing meta-analysis in the evaluation of computational models of infant language development

Blandón¹,

Cristià²,

Räsänen³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Computational models of child language development can help us understand the cognitive underpinnings of the language learning process. One advantage of computational modeling is that is has the potential to address multiple aspects of language learning within a single learning architecture. If successful, such integrated models would help to pave the way for a more comprehensive and mechanistic understanding of language development. However, in order to develop more accurate, holistic, and hence impactful models of infant language learning, the research on models also requires model evaluation practices that allow comparison of model behavior to empirical data from infants across a range of language capabilities. Moreover, there is a need for practices that can compare developmental trajectories of infants to those of models as a function of language experience. The present study aims to take the first steps to address these needs. More specifically, we will introduce the concept of comparing models with large-scale cumulative empirical data from infants, as quantified by meta-analyses conducted across a large number of individual behavioral studies. We start by formalizing the connection between measurable model and human behavior, and then present a basic conceptual framework for meta-analytic evaluation of computational models together with basic guidelines intended as a starting point for later work in this direction. We exemplify the meta-analytic model evaluation approach with two modeling experiments on infant-directed speech preference and native/non-native vowel discrimination. We also discuss the advantages, challenges, and potential future directions of meta-analytic evaluation practices.

show abstract

Section: Discussion For Experiments #2mentioning

confidence: 99%

Section: Previous Workmentioning

confidence: 99%

Section: Model Evaluation On Multiple Criteriamentioning

confidence: 99%

See 1 more Smart Citation

Introducing meta-analysis in the evaluation of computational models of infant language development

Blandón¹,

Cristià²,

Räsänen³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The opposite trend, viz. researchers interested in perceptual information in human development using neural networks as models, also exist: see for instance the work of Khorrami and Räsänen (2021); Nikolaus and Fourtassi (2021). A related trend in computational semantics relates specific aspects of meaning to situated information (Ebert et al, 2022;Ghaffari and Krishnaswamy, 2023, e.g.).…”

Section: Related Workmentioning

confidence: 99%

Grounded and well-rounded: a methodological approach to the study of cross-modal and cross-lingual grounding

Mickus,

Zosa,

Paperno

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems.In this paper, we establish a methodological framework for studying what the effects areif any-of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between crossmodally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.

show abstract

“…As commonly applied in other multimodal XSL work(Chrupała et al, 2015;Khorrami and Räsänen, 2021).6 WhileVinyals et al (2015) fed the image features only at the first timestep into the LSTM, here we feed it at every timestep as this showed to improve performance on our evaluation substantially. An explanation could be that when feeding the image features only at the first timestep the model gradually forgets about the input, and relies more on the language modeling task of next-word prediction, which does not aid the learning of visually-grounded semantics.…”

mentioning

confidence: 99%

Modeling the Interaction Between Perception-Based and Production-Based Learning in Children’s Early Acquisition of Semantic Knowledge

Nikolaus¹,

Fourtassi²

2021

Proceedings of the 25th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Children learn the meaning of words and sentences in their native language at an impressive speed and from highly ambiguous input. To account for this learning, previous computational modeling has focused mainly on the study of perception-based mechanisms like cross-situational learning. However, children do not learn only by exposure to the input. As soon as they start to talk, they practice their knowledge in social interactions and they receive feedback from their caregivers. In this work, we propose a model integrating both perception-and production-based learning using artificial neural networks which we train on a large corpus of crowd-sourced images with corresponding descriptions. We found that production-based learning improves performance above and beyond perception-based learning across a wide range of semantic tasks including both word-and sentence-level semantics. In addition, we documented a synergy between these two mechanisms, where their alternation allows the model to converge on more balanced semantic knowledge. The broader impact of this work is to highlight the importance of modeling language learning in the context of social interactions where children are not only understood as passively absorbing the input, but also as actively participating in the construction of their linguistic knowledge.

show abstract

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Cited by 5 publications

References 0 publications

Introducing meta-analysis in the evaluation of computational models of infant language development

Introducing meta-analysis in the evaluation of computational models of infant language development

Grounded and well-rounded: a methodological approach to the study of cross-modal and cross-lingual grounding

Modeling the Interaction Between Perception-Based and Production-Based Learning in Children’s Early Acquisition of Semantic Knowledge

Contact Info

Product

Resources

About