Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1824
|View full text |Cite
|
Sign up to set email alerts
|

Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data

Abstract: Sequence-to-sequence (S2S) text-to-speech (TTS) models can synthesise high quality speech when large amounts of annotated training data are available. Transcription errors exist in all data and are especially prevalent in found data such as audiobooks. In previous generations of TTS technology, alignment using Hidden Markov Models (HMMs) was widely used to identify and eliminate bad data. In S2S models, the use of attention replaces HMM-based alignment, and there is no explicit mechanism for removing bad data.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
7
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 13 publications
0
7
0
Order By: Relevance
“…In the context of G2P, a word error occurs when at least one phone in the word is incorrectly predicted. Further details regarding the system architecture and hyperparameter setup can be found in [17].…”
Section: Tts Model Architecturementioning
confidence: 99%
“…In the context of G2P, a word error occurs when at least one phone in the word is incorrectly predicted. Further details regarding the system architecture and hyperparameter setup can be found in [17].…”
Section: Tts Model Architecturementioning
confidence: 99%
“…Relatedly, pronunciation errors may also derive from a lack of deeper linguistic knowledge learned implicitly from textaudio pairs in the dataset. Increasingly, research demonstrates augmenting E2E-TTS with linguistic features improves quality in English, such as with phones [13] or with morphemes [14]. Pronunciation correction is also possible when mixing input representations between graphemes, phones and syllables [15,16].…”
Section: Linguistic Features In Tacotronmentioning
confidence: 99%
“…Previously, only closed vocabulary ASR had been used for transcription tasks, as in [17]. Recently, ASR has also been used for other tasks in TTS such as the automatic selection of "clean" training utterances and speakers [18], and for transcription of training recordings in [19].…”
Section: Introductionmentioning
confidence: 99%