2021
DOI: 10.34842/w3vw-s845
|View full text |Cite
|
Sign up to set email alerts
|

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…In other words, inputs that are similar in the MFCC domain are also likely to be similar in the latent encoding space, whereas inputs that are distant in the input space are also likely to be distant in the latent space before any learning has taken place. Since MFCCs already carry phonemic information while ignoring some non-phonemic variability (e.g., F0) due to their design, the corresponding latent encodings are also likely be discriminative with respect to vowel categories (see also, e.g., Chrupa la et al, 2020;Khorrami and Räsänen, 2021). In this context, it appears that much of the vowel discrimination is already explainable in terms of the input features, but the discriminatory characteristics of the latents then change somewhat as the models learn from input data: CPC improves on both native and non-native contrasts, while APC is relatively stable on the native contrasts and degrades on the non-native ones.…”
Section: Discussion For Experiments #2mentioning
confidence: 99%
See 2 more Smart Citations
“…In other words, inputs that are similar in the MFCC domain are also likely to be similar in the latent encoding space, whereas inputs that are distant in the input space are also likely to be distant in the latent space before any learning has taken place. Since MFCCs already carry phonemic information while ignoring some non-phonemic variability (e.g., F0) due to their design, the corresponding latent encodings are also likely be discriminative with respect to vowel categories (see also, e.g., Chrupa la et al, 2020;Khorrami and Räsänen, 2021). In this context, it appears that much of the vowel discrimination is already explainable in terms of the input features, but the discriminatory characteristics of the latents then change somewhat as the models learn from input data: CPC improves on both native and non-native contrasts, while APC is relatively stable on the native contrasts and degrades on the non-native ones.…”
Section: Discussion For Experiments #2mentioning
confidence: 99%
“…Another set of models operate directly on real continuous speech (e.g., Kamper et al, 2016;Nixon, 2020;Park and Glass, 2008;Schatz et al, 2021;Shain and Elsner, 2020). Besides processing language input only, there are models that use visual concurrent input in addition to spoken language (e.g., Alishahi et al, 2017;Chrupa la et al, 2017;Coen, 2006;Harwath et al, 2019;Harwath et al, 2016;Khorrami and Räsänen, 2021;Nikolaus and Fourtassi, 2021;Roy, 2005). Besides passive perception approaches, there are also models that can interact with simulated or real human caregivers (e.g., Howard and Messum, 2011;Rasilo and Räsänen, 2017) and studies using multiple computational agents that can interact with each other using some communicative means (e.g., Kirby, 2001;Moulin-Frier et al, 2015;Oudeyer, 2005; see also Oudeyer et al, 2019, for a recent review).…”
Section: Previous Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The opposite trend, viz. researchers interested in perceptual information in human development using neural networks as models, also exist: see for instance the work of Khorrami and Räsänen (2021); Nikolaus and Fourtassi (2021). A related trend in computational semantics relates specific aspects of meaning to situated information (Ebert et al, 2022;Ghaffari and Krishnaswamy, 2023, e.g.).…”
Section: Related Workmentioning
confidence: 99%
“…As commonly applied in other multimodal XSL work(Chrupała et al, 2015;Khorrami and Räsänen, 2021).6 WhileVinyals et al (2015) fed the image features only at the first timestep into the LSTM, here we feed it at every timestep as this showed to improve performance on our evaluation substantially. An explanation could be that when feeding the image features only at the first timestep the model gradually forgets about the input, and relies more on the language modeling task of next-word prediction, which does not aid the learning of visually-grounded semantics.…”
mentioning
confidence: 99%