Neural Speech Synthesis with Transformer Network

Li, Naihan; Liu, Shujie; Liu, Yanqing; Zhao, Shenghui; Liu, Ming

doi:10.1609/aaai.v33i01.33016706

Cited by 589 publications

(437 citation statements)

References 5 publications

Supporting

Mentioning

434

Contrasting

Unclassified

Order By: Relevance

“…The input of the encoder in TTS is a sequence of IDs corresponding to the input characters and the EOS symbol. First, the character ID sequence is converted into a sequence of character vectors with an embedding layer, and then the positional encoding scaled by a learnable scalar parameter is added to the vectors [4]. This process is a TTS implementation of EncPre(·) in Eq.…”

Section: Tts Encoder Architecturementioning

confidence: 99%

“…This network consists of two linear layers with 256 units, a ReLU activation function, and dropout followed by a projection linear layer with d att units. Since it is expected that the hidden representations converted by Prenet are located in the similar feature space to that of encoder features, Prenet helps to learn a diagonal encoder-decoder attention [4]. Then the decoder DecBody(·) in Eq.…”

Section: Tts Decoder Architecturementioning

confidence: 99%

“…Currently, existing Transformer-based speech applications [2]- [4] still lack an open source toolkit and reproducible experiments while previous studies in NMT [5], [6] provide them. Therefore, we work on an open community-driven project for end-to-end speech applications using both Transformer and RNN by following the success of Kaldi for hidden Markov model (HMM)-based ASR [7].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Comparative Study on Transformer vs RNN in Speech Applications

Karita¹,

Chen²,

Hayashi³

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

638

419

View full text Add to dashboard Cite

Sequence-to-sequence models have been widely used in end-toend speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.

show abstract

Section: Tts Encoder Architecturementioning

confidence: 99%

Section: Tts Decoder Architecturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Comparative Study on Transformer vs RNN in Speech Applications

Karita¹,

Chen²,

Hayashi³

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

638

419

View full text Add to dashboard Cite

show abstract

“…In particular, the training process becomes 4.82 times faster (from 13.5 days to 2.8 days with two NVIDIA Telsa V100 GPUs) and the inference process becomes 1.96 times faster (from 14.62 to 28.68 real-time 2 to generate 24 kHz speech waveforms with a single NVIDIA Telsa V100 GPU) compared with the conventional ClariNet model. • We combined the proposed Parallel WaveGAN with a TTS acoustic model based on a Transformer [15][16][17]. The perceptual listening tests verify that the proposed Parallel WaveGAN achieves 4.16 MOS, which is competitive to the best distillation-based ClariNet model.…”

Section: Introductionmentioning

confidence: 99%

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

Yamamoto

Song

Kim

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

622

449

View full text Add to dashboard Cite

We propose Parallel WaveGAN, a distillation-free, fast, and smallfootprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained even with a small number of parameters. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

show abstract

“…The original GST method uses an LSTM based Tacotron 2 [1] as the TTS backbone and an LSTM encoder for computing the style coefficients. For training efficiency and fair comparison, in our implementation of GST, we use Transformer TTS [3] for the content encoder and the decoder, and replace the LSTM with max-pooling for computing the style coefficients. We refer to our implementation of this method as GST * .…”

Section: Methodsmentioning

confidence: 99%

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

Shrivastava

Tuzel

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use the leaked content and ignore some of the input text to minimize the reconstruction loss. At inference time, when the reference speech does not match the content input, the output may not contain all of the content of the input text. We refer to this problem as "content leakage", which we address by explicitly estimating and minimizing the mutual information between the style and the content through an adversarial training formulation. We call our method MIST -Mutual Information based Style Content Separation. The main goal of the method is to preserve the input content in the synthesized speech signal, which we measure by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.

show abstract

Neural Speech Synthesis with Transformer Network

Cited by 589 publications

References 5 publications

A Comparative Study on Transformer vs RNN in Speech Applications

A Comparative Study on Transformer vs RNN in Speech Applications

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

Contact Info

Product

Resources

About