Speech synthesis applications have become an ubiquity, in navigation systems, digital assistants or as screen or audio book readers. Despite their impact on the acceptability of the systems in which they are embedded, and despite the fact that different applications probably need different types of TTS voices, TTS evaluation is still largely treated as an isolated problem. Even though there is strong agreement among researchers that the mainstream approaches to Text-to-Speech (TTS) evaluation are often insufficient and may even be misleading, there exist few clear-cut suggestions as to (1) how TTS evaluations may be realistically improved on a large scale, and (2) how such improvements may lead to an informed feedback for system developers and, ultimately, better systems relying on TTS. This paper reviews the current state-of-the-art in TTS evaluation, and suggests a novel user-centered research program for this area.
"But the increasing interest in connected, more specifically spontaneous speech data bases has made it mandatory for researchers to enter into phonetics and phonology above the word in real-life communication, and it is in this domain that glottalisation phenomena abound." (Kohler 2001: 317) Abstract: The present paper examines glottal stops and the glottalisation of word-initial vowels in Polish and German. The presence of glottal marking is studied depending on speech style ('speech' vs. 'dialogue'), prominence, phrasal position, speech rate, word type, preceding segment, and following vowel height. A question is also posed about the extent to which glottal marking might be dependent on the rhythmic structure of a given language or style. We analyzed recordings of 18 Polish and German speakers. The results point to significant differences between the two languages. In German, glottal marking occurs significantly more often (63.4%) than in Polish (45%). Whereas in both languages (and both styles) the majority of prominent vowels are more often glottally marked than non-prominent vowels, in German word-initial non-prominent syllables are also marked relatively often. Regarding phrase position, glottal marking occurs significantly more often at the phrase-initial position compared to phrase-medial position in Polish, while no such difference has been found in German. In addition, it is shown that in both languages glottal marking is strongly dependent on the tongue height of the marked vowel: low vowels are more frequently glottalised than non-low vowels. Finally, glottal marking in Polish is more likely to occur when rhythmic variability shifts towards the 'indeterminate', strengthening the hypothesis that glottal marking facilitates perceptual grouping.
Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk's Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.
We report on an analysis of feedback behavior in an Active Listening Corpus as produced verbally, visually (head movement) and bimodally. The behavior is modeled in an embodied conversational agent and displayed in a conversation with a real human to human participants for perceptual evaluation. Five strategies for the timing of backchannels are compared: copying the timing of the original human listener, producing backchannels at randomly selected times, producing backchannels according to high level timing distributions relative to the interlocutor's utterance and pauses, or according to local entrainment to the interlocutors' vowels, or according to both. Human observers judge that models with global timing distributions miss less opportunities for backchanneling than random timing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.