Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Suni, Antti; Kakouros, Sofoklis; Vainio, Martti; Simko, Juraj

doi:10.21437/speechprosody.2020-192

Cited by 10 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ye et al (2023) show that the prosody pattern reflecting relevant text context contributes to the enhanced TTS. Suni et al (2020); Nguyen et al (2020) suggest that prosodic boundary plays an important role in the naturalness and intelligibility of speech. To incorporate an inherent prosodic structure of input utterance within its context, we build Prosodic Boundary Detector (PBD) trained on a large corpus along with prominence labels.…”

Section: Prosody Phrasingmentioning

confidence: 99%

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

Joo,

Koh,

Jung

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

With the rapid advancement in deep generative models, recent neural Text-To-Speech (TTS) models have succeeded in synthesizing humanlike speech. There have been some efforts to generate speech with various prosody beyond monotonous prosody patterns. However, previous works have several limitations. First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody. Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity, thereby hampering the naturalness of the speech. Second, the diversity among samples is neglected since the sampling procedure often focuses on a single speech sample rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a new objective function and prosody diversifying module. Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples. We demonstrate that DPP-TTS generates speech samples with more diversified prosody than baselines in the side-by-side comparison test considering the naturalness of speech at the same time.

show abstract

Section: Prosody Phrasingmentioning

confidence: 99%

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

Joo,

Koh,

Jung

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Another controllable aspect is the speaker voice, introduced through additional speakeremebeddings extracted through a speaker verification network [13]. Finally, [27] proposed a methodology to control the prominence and boundaries by automatically deriving prosodic tags to augment the input character sequence. It is also possible to combine multiple techniques into a single conditioned architecture, as shown by [26].…”

Section: Related Workmentioning

confidence: 99%

ITAcotron 2: The Power of Transfer Learning in Expressive TTS Synthesis

Favaro

Sbattella

Tedesco

et al. 2022

Signals and Communication Technology

View full text Add to dashboard Cite

Text-to-Speech (TTS) synthesizer has to generate intelligible and natural speech while modeling linguistic and paralinguistic components characterizing human voice. In this work, we present ITAcotron 2, an Italian TTS synthesizer able to generate speech in several voices. In its development, we explored the power of transfer learning by iteratively fine-tuning an English Tacotron 2 spectrogram predictor on different Italian data sets. Moreover, we introduced a conditioning strategy to enable ITAcotron 2 to generate new speech in the voice of a variety of speakers. To do so, we examined the zero-shot behaviour of a speaker encoder architecture, previously trained to accomplish a speaker verification task with English speakers, to represent Italian speakers' voiceprints. We asked 70 volunteers to evaluate intelligibility, naturalness, and similarity between synthesised voices and real speech from target speakers. Our model achieved a MOS score of 4.15 in intelligibility, 3.32 in naturalness, and 3.45 in speaker similarity. These results showed the successful adaptation of the refined system to the new language and its ability to synthesize novel speech in the voice of several speakers.

show abstract

“…This is the solution we decided to follow for the VC. Additionally, to further refine speech synthesis, we considered augmenting the text to be pronounced with prosodic clues (predicted by the empathetic controller) [20] useful, for example, for putting the focus on specific words.…”

Section: Output Conditioning Modulesmentioning

confidence: 99%

A Modular Data-Driven Architecture for Empathetic Conversational Agents

Scotti

Tedesco

Sbattella

2021

2021 IEEE International Conference on Big Data and Smart Computing (BigComp)

View full text Add to dashboard Cite

Empathy is a fundamental mechanism of human interactions. As such, it should be an integral part of Human-Computer Interaction systems to make them more relatable. With this work, we focused on conversational scenarios where integrating empathy is crucial to perceive the computer as a human. As a result, we derived the high-level architecture of an Empathetic Conversational Agent we are willing to implement. We relied on theories about artificial empathy to derive the function approximating this mechanism, and selected the conversational aspects to control for an empathetic interaction. In particular, a core empathetic controller manages the empathetic responses, predicting, at each turn, the high-level content of the response. The derived architecture integrates empathy in a taskagnostic manner, hence we can employ it in multiple scenarios by changing the objective of the controller.

show abstract

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Cited by 10 publications

References 18 publications

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

ITAcotron 2: The Power of Transfer Learning in Expressive TTS Synthesis

A Modular Data-Driven Architecture for Empathetic Conversational Agents

Contact Info

Product

Resources

About