Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer

Gonzalvo, Xavi; Tazari, Siamak; Chan, Chin-Feng; Becker, Markus C.; Gutkin, Alexander; Silén, Hanna

doi:10.21437/interspeech.2016-264

Cited by 51 publications

(32 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to better isolate the effect of using mel spectrograms as features, we compare to a WaveNet conditioned on linguistic features [8] with similar modifications to the WaveNet architecture as introduced above. We also compare to the original Tacotron that predicts linear spectrograms and uses Griffin-Lim to synthesize audio, as well as concatenative [30] and parametric [31] baseline systems, both of which have been used in production at Google. We find that the proposed system significantly outpeforms all other TTS systems, and results in an MOS comparable to that of the ground truth audio.…”

Section: Discussionmentioning

confidence: 99%

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Shen

Pang

Weiss

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2,046

1,821

View full text Add to dashboard Cite

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.

show abstract

Section: Discussionmentioning

confidence: 99%

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Shen

Pang

Weiss

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2,046

1,821

View full text Add to dashboard Cite

show abstract

“…When computing MOS, we only include ratings where headphones were used. We compare our model with a parametric (based on LSTM ) and a concatenative system (Gonzalvo et al, 2016), both of which are in production. As shown in Table 2, Tacotron achieves an MOS of 3.82, which outperforms the parametric system.…”

Section: Mean Opinion Score Testsmentioning

confidence: 99%

Tacotron: Towards End-to-End Speech Synthesis

et al. 2017

View full text Add to dashboard Cite

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-tosequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

show abstract

“…To evaluate the performance of contextual biasing, we report performance on a contacts test set, which consists of requests to call/text contacts. This set is created by mining contact names from the web, and synthesizing TTS utterances in each of these categories using a concatenative TTS approach with one voice [25]. Noise is then artificially added to the TTS data, similar to the process described above [24].…”

Section: Data Setsmentioning

confidence: 99%

Two-Pass End-to-End Speech Recognition

C¹,

Pang²,

Rybach³

et al. 2019

Interspeech 2019

111

View full text Add to dashboard Cite

The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models [1]. However, this model still lags behind a large state-of-the-art conventional model in quality [2]. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models [3]. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.

show abstract

Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer

Cited by 51 publications

References 12 publications

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Tacotron: Towards End-to-End Speech Synthesis

Two-Pass End-to-End Speech Recognition

Contact Info

Product

Resources

About