Using generative modelling to produce varied intonation for speech synthesis

Hodari, Zack; Watts, Oliver; King, Simon

doi:10.21437/ssw.2019-43

Cited by 20 publications

(17 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, if the end-to-end system has no overt model of stress, how might we control emphasis? Recent work [Hodari et al 2019, Skerry-Ryan et al 2018] explores this issue and proposes various solutions. However, from a practical perspective, when using an end-to-end speech synthesis system for a SIA it is important to be aware of what expressive speech control is available, if any, as well as the ability to correct errors in the synthesis produced.…”

Section: Designing Expressive Speech -Cross Speaker Features 1351 Languagementioning

confidence: 99%

Building and Designing Expressive Speech Synthesis

Aylett¹,

Clark²,

Cowan³

et al. 2021

The Handbook on Socially Interactive Agents

View full text Add to dashboard Cite

You all know the test for artificial intelligence -the Turing test. A human judge has a conversation with a human and a computer. If the judge can't tell the machine apart from the human, the machine has passed the test. I now propose a test for computer voices -the Ebert test. If a computer voice can successfully tell a joke and do the timing and delivery as well as Henny Youngman, then that's the voice I want." -Roger Ebert.

show abstract

Section: Designing Expressive Speech -Cross Speaker Features 1351 Languagementioning

confidence: 99%

Building and Designing Expressive Speech Synthesis

Aylett¹,

Clark²,

Cowan³

et al. 2021

The Handbook on Socially Interactive Agents

View full text Add to dashboard Cite

show abstract

“…Human speech varies due to contextual as well as other more arbitrary factors such as prosody dynamics. There have been studies suggesting to add this idiosyncratic or dynamic variation to a "flat" artificial voice (Hodari et al, 2019). However, it is still unclear how much variability is needed and what is an optimal combination of intonational features (Velner et al, 2020).…”

Section: Qualitative Analysesmentioning

confidence: 99%

The Human Takes It All: Humanlike Synthesized Voices Are Perceived as Less Eerie and More Likable. Evidence From a Subjective Ratings Study

2020

View full text Add to dashboard Cite

Background: The increasing involvement of social robots in human lives raises the question as to how humans perceive social robots. Little is known about human perception of synthesized voices.Aim: To investigate which synthesized voice parameters predict the speaker's eeriness and voice likability; to determine if individual listener characteristics (e.g., personality, attitude toward robots, age) influence synthesized voice evaluations; and to explore which paralinguistic features subjectively distinguish humans from robots/artificial agents.Methods: 95 adults (62 females) listened to randomly presented audio-clips of three categories: synthesized (Watson, IBM), humanoid (robot Sophia, Hanson Robotics), and human voices (five clips/category). Voices were rated on intelligibility, prosody, trustworthiness, confidence, enthusiasm, pleasantness, human-likeness, likability, and naturalness. Speakers were rated on appeal, credibility, human-likeness, and eeriness. Participants' personality traits, attitudes to robots, and demographics were obtained.Results: The human voice and human speaker characteristics received reliably higher scores on all dimensions except for eeriness. Synthesized voice ratings were positively related to participants' agreeableness and neuroticism. Females rated synthesized voices more positively on most dimensions. Surprisingly, interest in social robots and attitudes toward robots played almost no role in voice evaluation. Contrary to the expectations of an uncanny valley, when the ratings of human-likeness for both the voice and the speaker characteristics were higher, they seemed less eerie to the participants. Moreover, when the speaker's voice was more humanlike, it was more liked by the participants. This latter point was only applicable to one of the synthesized voices. Finally, pleasantness and trustworthiness of the synthesized voice predicted the likability of the speaker's voice. Qualitative content analysis identified intonation, sound, emotion, and imageability/embodiment as diagnostic features.Discussion: Humans clearly prefer human voices, but manipulating diagnostic speech features might increase acceptance of synthesized voices and thereby support human-robot interaction. There is limited evidence that human-likeness of a voice is negatively linked to the perceived eeriness of the speaker.

show abstract

“…While the naturalness of state-of-the-art TTS is near identical to human speech [1], the prosody may not always be realistic [2]. Overall, prosody in TTS might be described as boring or flat [3], and can become fatiguing for listeners [4].…”

Section: Introductionmentioning

confidence: 99%

Camp: A Two-Stage Approach to Modelling Prosody in Context

Hodari

Moinet

Karlapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a contextdependent prior over our prosodic space. Our context-aware model of prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

show abstract

Using generative modelling to produce varied intonation for speech synthesis

Cited by 20 publications

References 36 publications

Building and Designing Expressive Speech Synthesis

Building and Designing Expressive Speech Synthesis

The Human Takes It All: Humanlike Synthesized Voices Are Perceived as Less Eerie and More Likable. Evidence From a Subjective Ratings Study

Camp: A Two-Stage Approach to Modelling Prosody in Context

Contact Info

Product

Resources

About