Even though ambiguous words are common in languages, children find it hard to learn homophones, where a single label applies to several distinct meanings (e.g., Mazzocco, 1997). The present work addresses this apparent discrepancy between learning abilities and typological pattern, with respect to homophony in the lexicon. In a series of five experiments, 20-month-old French children easily learnt a pair of homophones if the two meanings associated with the phonological form belonged to different syntactic categories, or to different semantic categories. However, toddlers failed to learn homophones when the two meanings were distinguished only by different grammatical genders. In parallel, we analyzed the lexicon of four languages, Dutch, English, French and German, and observed that homophones are distributed non-arbitrarily in the lexicon, such that easily learnable homophones are more frequent than hard-to-learn ones: pairs of homophones are preferentially distributed across syntactic and semantic categories, but not across grammatical gender. We show that learning homophones is easier than previously thought, at least when the meanings of the same phonological form are made sufficiently distinct by their syntactic or semantic context. Following this, we propose that this learnability advantage translates into the overall structure of the lexicon, i.e., the kinds of homophones present in languages exhibit the properties that make them learnable by toddlers, thus allowing them to remain in languages.
A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input where these basic units are not overtly segmented. A number of unsupervised learning algorithms have been proposed in the last 20 years for these purposes, some of which have been implemented computationally, but whose results remain difficult to compare across papers. We created a tool that is open source, enables reproducible results, and encourages cumulative science in this domain. WordSeg has a modular architecture: It combines a set of corpora description routines, multiple algorithms varying in complexity and cognitive assumptions (including several that were not publicly available, or insufficiently documented), and a rich evaluation package. In the paper, we illustrate the use of this package by analyzing a corpus of child-directed speech in various ways, which further allows us to make recommendations for experimental design of follow-up work. Supplementary materials allow readers to reproduce every result in this paper, and detailed online instructions further enable them to go beyond what we have done. Moreover, the system can be installed within container software that ensures a stable and reliable environment. Finally, by virtue of its modular architecture and transparency, WordSeg can work as an open-source platform, to which other researchers can add their own segmentation algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.