Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-264
|View full text |Cite
|
Sign up to set email alerts
|

Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer

Abstract: This paper presents advances in Google's hidden Markov model (HMM)-driven unit selection speech synthesis system. We describe several improvements to the run-time system; these include minimal latency, high-quality and fast refresh cycle for new voices. Traditionally unit selection synthesizers are limited in terms of the amount of data they can handle and the real applications they are built for. That is even more critical for reallife large-scale applications where high-quality is expected and low latency is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 51 publications
(32 citation statements)
references
References 12 publications
0
32
0
Order By: Relevance
“…In order to better isolate the effect of using mel spectrograms as features, we compare to a WaveNet conditioned on linguistic features [8] with similar modifications to the WaveNet architecture as introduced above. We also compare to the original Tacotron that predicts linear spectrograms and uses Griffin-Lim to synthesize audio, as well as concatenative [30] and parametric [31] baseline systems, both of which have been used in production at Google. We find that the proposed system significantly outpeforms all other TTS systems, and results in an MOS comparable to that of the ground truth audio.…”
Section: Discussionmentioning
confidence: 99%
“…In order to better isolate the effect of using mel spectrograms as features, we compare to a WaveNet conditioned on linguistic features [8] with similar modifications to the WaveNet architecture as introduced above. We also compare to the original Tacotron that predicts linear spectrograms and uses Griffin-Lim to synthesize audio, as well as concatenative [30] and parametric [31] baseline systems, both of which have been used in production at Google. We find that the proposed system significantly outpeforms all other TTS systems, and results in an MOS comparable to that of the ground truth audio.…”
Section: Discussionmentioning
confidence: 99%
“…When computing MOS, we only include ratings where headphones were used. We compare our model with a parametric (based on LSTM ) and a concatenative system (Gonzalvo et al, 2016), both of which are in production. As shown in Table 2, Tacotron achieves an MOS of 3.82, which outperforms the parametric system.…”
Section: Mean Opinion Score Testsmentioning
confidence: 99%
“…To evaluate the performance of contextual biasing, we report performance on a contacts test set, which consists of requests to call/text contacts. This set is created by mining contact names from the web, and synthesizing TTS utterances in each of these categories using a concatenative TTS approach with one voice [25]. Noise is then artificially added to the TTS data, similar to the process described above [24].…”
Section: Data Setsmentioning
confidence: 99%