Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20–44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a ‘wide’ yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.
Recent research on the acoustic realization of affixes has revealed differences between phonologically homophonous affixes, e.g. the different kinds of final [s] and [z] in English (Plag, Homann & Kunter 2017, Zimmermann 2016a). Such results are unexpected and unaccounted for in widely accepted post-Bloomfieldian item-and-arrangement models (Hockett 1954), which separate lexical and post-lexical phonology, and in models which interpret phonetic effects as consequences of different prosodic structure. This paper demonstrates that the differences in duration of English final S as a function of the morphological function it expresses (non-morphemic, plural, third person singular, genitive, genitive plural, cliticizedhas, and cliticizedis) can be approximated by considering the support for these morphological functions from the words’ sublexical and collocational properties. We estimated this support using naïve discriminative learning and replicated previous results for English vowels (Tucker, Sims & Baayen 2019), indicating that segment duration is lengthened under higher functional certainty but shortened under functional uncertainty. We discuss the implications of these results, obtained with a wide learning network that eschews representations for morphemes and exponents, for models in theoretical morphology as well as for models of lexical processing.
AbstractMany studies report shorter acoustic durations, more coarticulation and reduced articulatory targets for frequent words. This study investigates a factor ignored in discussions on the relation between frequency and phonetic detail, namely, that motor skills improve with experience. Since frequency is a measure of experience, it follows that frequent words should show increased articulatory proficiency. We used EMA to test this prediction on German inflected verbs with [a] as stem vowels. Modeling median vertical tongue positions with quantile regression, we observed significant modulation by frequency of the U-shaped trajectory characterizing the articulation of the [a:]. These modulations reflect two constraints, one favoring smooth trajectories through anticipatory coarticulation, and one favoring clear articulation by realizing lower minima. The predominant pattern across sensors, exponents, and speech rate suggests that the constraint of clarity dominates for lower-frequency words. For medium-frequency words, the smoothness constraint leads to a raising of the trajectory. For the higher-frequency words, both constraints are met simultaneously, resulting in low minima and stronger coarticulation. These consequences of motor practice for articulation challenge both the common view that a higher-frequency of use comes with more articulatory reduction, and cognitive models of speech production positing that articulation is post-lexical.
The present study introduces articulography, the measurement of the position of tongue and lips during speech, as a promising method to the study of dialect variation. By using generalized additive modeling to analyze articulatory trajectories, we are able to reliably detect aggregate group differences, while simultaneously taking into account the individual variation across dozens of speakers. Our results on the basis of Dutch dialect data show clear differences between the southern and the northern dialect with respect to tongue position, with a more frontal tongue position in the dialect from Ubbergen (in the southern half of the Netherlands) than in the dialect of Ter Apel (in the northern half of the Netherlands). Thus articulography appears to be a suitable tool to investigate structural differences in pronunciation at the dialect level.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.