Proceedings of the 24th Conference on Computational Natural Language Learning 2020
DOI: 10.18653/v1/2020.conll-1.15
|View full text |Cite
|
Sign up to set email alerts
|

Acquiring language from speech by learning to remember and predict

Abstract: Classical accounts of child language learning invoke memory limits as a pressure to discover sparse, language-like representations of speech, while more recent proposals stress the importance of prediction for language learning. In this study, we propose a broadcoverage unsupervised neural network model to test memory and prediction as sources of signal by which children might acquire language directly from the perceptual stream. Our model embodies several likely properties of real-time human cognition: it is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 103 publications
0
8
0
Order By: Relevance
“…Future work could explore the extent to which syntactic knowledge can be acquired from lower-level (e.g. phonemic or acoustic) input alone by including a word segmentation task (Elsner and Shain, 2017;Shain and Elsner, 2020) for the model. Additionally, recent work in unsupervised grammar induction (Jin and Schuler, 2020;Zhang et al, 2021) has shown that incorporating visual information in the form of images and videos helps learn constituents that denote entities or action.…”
Section: Discussionmentioning
confidence: 99%
“…Future work could explore the extent to which syntactic knowledge can be acquired from lower-level (e.g. phonemic or acoustic) input alone by including a word segmentation task (Elsner and Shain, 2017;Shain and Elsner, 2020) for the model. Additionally, recent work in unsupervised grammar induction (Jin and Schuler, 2020;Zhang et al, 2021) has shown that incorporating visual information in the form of images and videos helps learn constituents that denote entities or action.…”
Section: Discussionmentioning
confidence: 99%
“…In terms of input, some models operate on linguistic abstractions of speech, such as phonemic, phonetic or orthographic transcripts (e.g., Frank et al, 2010;Goldwater et al, 2009;Nikolaus and Fourtassi, 2021), phonetic or lexical representations derived using pre-trained automatic speech recognition systems (e.g., Fourtassi and Dupoux, 2014;Roy, 2005;Salvi et al, 2012), or by using some simplified representations of acoustic speech, such as formant frequencies of pre-segmented vowels (Coen, 2006;de Boer and Kuhl, 2003). Another set of models operate directly on real continuous speech (e.g., Kamper et al, 2016;Nixon, 2020;Park and Glass, 2008;Schatz et al, 2021;Shain and Elsner, 2020). Besides processing language input only, there are models that use visual concurrent input in addition to spoken language (e.g., Alishahi et al, 2017;Chrupa la et al, 2017;Coen, 2006;Harwath et al, 2019;Harwath et al, 2016;Khorrami and Räsänen, 2021;Nikolaus and Fourtassi, 2021;Roy, 2005).…”
Section: Previous Workmentioning
confidence: 99%
“…Instead of analyzing clustering purity, the ABX-test analyzes phonemic discriminability of internal representations learned by a model. In studies focusing on speech segmentation, such as phone (Michel et al, 2016;Räsänen, 2014;Scharenborg et al, 2007), syllable (Räsänen et al, 2018), or word segmentation (Shain and Elsner, 2020), the model is typically a system that has a mechanism to identify temporal positions of unit boundaries in time. These boundaries are then compared to unit boundaries in ground-truth phonetic or word-level annotations.…”
Section: Reference Point? Pros Consmentioning
confidence: 99%
“…Methods for automatically learning phone-or word-like units from unlabelled speech audio could enable speech technology in severely low-resourced settings [1,2] and could lead to new cognitive models of human language acquisition [3][4][5]. The goal in unsupervised representation learning of phone units is to learn features which capture phonetic contrasts while being invariant to properties like the speaker or channel.…”
Section: Introductionmentioning
confidence: 99%
“…We evaluate these on four different tasks: unsupervised phone segmentation [23], ABX phone discrimination [24], same-different word discrimination [25], and as inputs to a symbolic word segmentation algorithm [1]. The last-mentioned is particularly important since the segmentation and clustering of word-like units remains a major but important challenge [5,26].…”
Section: Introductionmentioning
confidence: 99%