This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behavior, respectively. For identical contexts, listeners were better at discriminating between oral and nasalized vowels in neural than in concatenative TTS for nasalized same-vowel trials, but better discrimination for concatenative TTS was observed for oral same-vowel trials. Meanwhile, listeners displayed less compensation for coarticulation in neural than in concatenative TTS. To determine whether apparent roboticity of the TTS voice shapes vowel discrimination and compensation patterns, a “roboticized” version of neural TTS was generated (monotonized f0 and addition of an echo), holding phonetic nasality constant; a ratings study (experiment 2) confirmed that the manipulation resulted in different apparent roboticity. Experiment 3 compared the discrimination of unmodified neural TTS and roboticized neural TTS: listeners displayed lower accuracy in identical contexts for roboticized relative to unmodified neural TTS, yet the performances in alternating contexts were similar.
Apart from other requirements set by the experimenter, bar-pressing situations have at least one characteristic in common: the animal must learn how hard to press the bar if it is to procure reinforcement. (Notterman, 1959, has described the emission of forces over the course of operant level, CRF, and extinction.) This requirement is to be distinguished from additional demands placed upon the organism in the typical experiment involving discrimination of exteroceptive stimuli. In the latter instance, and in addition to the intensive requirements, the animal must learn when to press and when not to press. The eventual disparity in SD_&A rate of responding has been used as an indicator of the extent of the discrimination thus established (Frick, 1949;Dinsmoor, 1951;Smith & Hoy, 1954).Since the organism is not reinforced in SD unless it presses hard enough, and since it goes unreinforced in St no matter how hard it presses, experiments concerned with the establishment of a discrimination to exteroceptive stimuli inevitably and concomittantly involve the possibility of interaction between the "when-to-press" and "how-hard-to-press" aspects of bar-pressing behavior. This report is a preliminary attempt to describe the relation between response differentiation and stimulus discrimination during the development of stimulus discrimination. METHODBy means of a pair of strain gauges used as a force pick-off, the force-related voltages characteristic of successive bar-pressing responses during the establishment of a discrimination were passed through the amplifiers of an analog computer, and the peak force of each response noted. (Apparatus details are available in Notterman, 1959.) A total of 15 discrimination sessions was given, preceded by two unreinforced basal or operant-level sessions. In order to preclude any temporal conditioning, each of the daily sessions consisted of periods of SD and St having the following irregular sequence: 80 seconds of SD; 80 of St; 40, SD; 20, St; 20, SD; 160, SI; 10, SD; 40, St; 160, SD; and 10, VA. This sequence was programmed on a continuous tape such that daily sessions could begin at any point in the schedule and run either forward or backward. A daily experimental session consisted of two entire sequences, making for a total time of 20 minutes and 40 seconds. A CRF schedule was maintained during SD, with a required force of 3.0 grams'minimum for both reinforcement and recording, and with The P. J. Noyes Company 0.45-milligram pellets as reinforcements. RESULTS AND DISCUSSIONFigures 1 and 2 represent the data for the first Wistar rat exposed to this procedure; they are typical of data obtained from some 20 additional animals during a related parametric study currently underway as a dissertation by the junior author.The first figure reveals the customary SD_SA separation in number of responses emitted during each session as a function of successive discrimination sessions. The SD responding is
This study explores the production and perception of word-final devoicing in German across text-to-speech (from technology used in common voice-AI “smart” speaker devices—specifically, voices from Apple and Amazon) and naturally produced utterances. First, the phonetic realization of word-final devoicing in German across text-to-speech (TTS) and naturally produced word productions was compared. Acoustic analyses reveal that the presence of cues to a word-final voicing contrast varied across speech types. Naturally produced words with phonologically voiced codas contain partial voicing, as well as longer vowels than words with voiceless codas. However, these distinctions are not present in TTS speech. Next, German listeners completed a forced-choice identification task, in which they heard the words and made coda consonant categorizations, in order to examine the intelligibility consequences of the word-final devoicing patterns across speech types. Intended coda identifications are higher for the naturally produced productions than for TTS. Moreover, listeners systematically misidentified voiced codas as voiceless in TTS words. Overall, this study extends previous literature on speech intelligibility at the intersection of speech synthesis and contrast neutralization. TTS voices tend to neutralize salient phonetic cues present in natural speech. Subsequently, listeners are less able to identify phonological distinctions in TTS. We also discuss how investigating which cues are more salient in natural speech can be beneficial in synthetic speech generation to make them more natural and also easier to perceive.
Danish, like closely related Swedish and Norwegian, has descended from Old Norse (Haugen 1976). While the three contemporary languages are variably mutually intelligible, Danish has phonologically diverged from the other Scandinavian languages (Gooskens 2006). This is caused by extensive consonant lenition and vowel reduction within Danish (Basbøll 2005). The lenition of <t> and <d> in syllable coda positions into a sound that Danish linguists have called soft-d is seemingly unique to the Danish. In most phonological descriptions, it is transcribed using the phonetic symbol /ð/, a voiced interdental fricative. We assert that this is not accurate; not all phonologists agree that the soft-d is a fricative. Some describe it as an alveolar semi-vowel (Haberland 1994), while others transcribe it as a velarized, retracted, and lowered alveolar approximant (Basbøll 2005). Many observe that the sound resembles lateral /l/, a distinct phoneme of Danish (Wells, 2010). Through acoustic analysis of tokens taken from the DanPASS corpus (Grønnum 2016) we show that the acoustic properties (HNR) of soft-d are indeed not the same as a fricative, but rather that of an approximant or vowel. Therefore, the use of /ð/ to transcribe this symbol is inaccurate and does not align with the goals of the International Phonetic Association.
The decision to include or exclude phonemes in the description of a language is not always straightforward; presentations of the phoneme inventory of Modern Standard German (MSG) often include a discussion of why /ɛ:/ is problematic as a phoneme. This study describes the acoustic realization of /ɛ:/ in comparison to /e:/ in spoken German, specifically South Westphalian. 39 native German speakers produced /ɛ:/ and /e:/ in hVt non-word frames and vowel productions were measured for: (1) first and second formants from the steady state of the vowel, (2) duration, and (3) fundamental frequency (f0). Measurements were analyzed with a logistic regression model using the glm package in R. The model showed that while the main effects of F2, duration, and pitch were not significant, F1 was; speakers reliably produced /ɛ:/ lower in the vowel space than /e:/, but not fronter. This preliminary investigation into the acoustic realizations of /ɛ:/ and /e:/ through the lens of the debate on whether these two sounds truly are phonetically and phonemically contrastive is a first step toward truly understanding these two sounds within the larger phonemic inventory of MSG. We hope that this study will reopen a discussion on this topic and help answer the question of whether /ɛ:/ really is a problematic phoneme.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.