Speech synthesis has reached a reasonable high quality in recent years. However, there is still room for improvement in terms of naturalness and expressiveness when dealing with large multisentential discourse, since most text-to-speech synthesizers do not fully take into account the prosodic differences that have been observed in discourse units such as paragraphs. This work presents an implementation of paragraph-based prosodic patterns into the open-source MARYTTS platform, enriching its prosody output by means of intra-and inter-paragraph prosodic features. The set of characteristics include pitch decay, pitch range and speech rate variation (as intra-paragraph features), as well as paragraph break pauses and speech rate variation (as inter-paragraph features), previously analyzed in a large set of TED Talks and read-speech sections of the Spoken Wikipedia Corpus. The perception tests, performed both in English and German parametric voices, suggest that paragraph-based features should be further studied and taken into account on future implementations to synthesize large discourse speech.
State-of-the-art end-to-end speech synthesis models have reached levels of quality close to human capabilities. However, there is still room for improvement in terms of naturalness, related to prosody, which is essential for human-machine interaction. Therefore, part of current research has shift its focus on improving this aspect with many solutions, which mainly involve prosody adaptability or control. In this work, we explored a way to include linguistic features into the sequenceto-sequence Tacotron2 system to improve the naturalness of the generated voice. That is, making the prosody of the synthesis looking more like the real human speaker. Specifically we embedded with an additional encoder part-of-speech tags and punctuation mark locations of the input text to condition Tacotron2 generation. We propose two different architectures for this parallel encoder: one based on a stack of convolutional plus recurrent layers, and another formed by a stack of bidirectional recurrent plus linear layers. To evaluate the similarity between real read-speech and synthesis, we carried out an objective test using signal processing metrics and a perceptual test. The presented results show that we achieved an improvement in naturalness.
Text-to-speech (TTS) systems provide fundamental reading support for people with aphasia and reading difficulties. However, artificial voices are more difficult to process than natural voices. The current study is an extended analysis of the results of a clinical experiment investigating which, among three artificial voices and a digitised human voice, is more suitable for people with aphasia and reading impairments. Such results show that the voice synthesised with Ogmios TTS, a concatenative speech synthesis system, caused significantly slower reaction times than the other three voices used in the experiment. The present study explores whether and what voice quality metrics are linked to delayed reaction times. For this purpose, the voices were analysed using an automatic assessment of intelligibility, naturalness, and jitter and shimmer voice quality parameters. This analysis revealed that Ogmios TTS, in general, performed worse than the other voices in all parameters. These observations could explain the significantly delayed reaction times in people with aphasia and reading impairments when listening to Ogmios TTS and could open up consideration about which TTS to choose for compensative devices for these patients based on the voice analysis of these parameters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.