Rob Clark scite author profile

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-tosequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

show abstract

Tacotron: Towards End-to-End Speech Synthesis

Wang

Skerry-Ryan

Stanton

et al. 2017

Preprint

215

279

View full text Add to dashboard Cite

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Dang²,

et al. 2019

View full text Add to dashboard Cite

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.

show abstract

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Wang

Yamagishi

Todisco

et al. 2020

Computer Speech & Language

266

135

View full text Add to dashboard Cite

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen

Dang²,

Clark

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rob Clark

Tacotron: Towards End-to-End Speech Synthesis

Tacotron: Towards End-to-End Speech Synthesis

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Contact Info

Product

Resources

About