2019
DOI: 10.1609/aaai.v33i01.33016706
|View full text |Cite
|
Sign up to set email alerts
|

Neural Speech Synthesis with Transformer Network

Abstract: Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
434
0
2

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 589 publications
(437 citation statements)
references
References 5 publications
1
434
0
2
Order By: Relevance
“…The input of the encoder in TTS is a sequence of IDs corresponding to the input characters and the EOS symbol. First, the character ID sequence is converted into a sequence of character vectors with an embedding layer, and then the positional encoding scaled by a learnable scalar parameter is added to the vectors [4]. This process is a TTS implementation of EncPre(·) in Eq.…”
Section: Tts Encoder Architecturementioning
confidence: 99%
See 2 more Smart Citations
“…The input of the encoder in TTS is a sequence of IDs corresponding to the input characters and the EOS symbol. First, the character ID sequence is converted into a sequence of character vectors with an embedding layer, and then the positional encoding scaled by a learnable scalar parameter is added to the vectors [4]. This process is a TTS implementation of EncPre(·) in Eq.…”
Section: Tts Encoder Architecturementioning
confidence: 99%
“…This network consists of two linear layers with 256 units, a ReLU activation function, and dropout followed by a projection linear layer with d att units. Since it is expected that the hidden representations converted by Prenet are located in the similar feature space to that of encoder features, Prenet helps to learn a diagonal encoder-decoder attention [4]. Then the decoder DecBody(·) in Eq.…”
Section: Tts Decoder Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…In particular, the training process becomes 4.82 times faster (from 13.5 days to 2.8 days with two NVIDIA Telsa V100 GPUs) and the inference process becomes 1.96 times faster (from 14.62 to 28.68 real-time 2 to generate 24 kHz speech waveforms with a single NVIDIA Telsa V100 GPU) compared with the conventional ClariNet model. • We combined the proposed Parallel WaveGAN with a TTS acoustic model based on a Transformer [15][16][17]. The perceptual listening tests verify that the proposed Parallel WaveGAN achieves 4.16 MOS, which is competitive to the best distillation-based ClariNet model.…”
Section: Introductionmentioning
confidence: 99%
“…The original GST method uses an LSTM based Tacotron 2 [1] as the TTS backbone and an LSTM encoder for computing the style coefficients. For training efficiency and fair comparison, in our implementation of GST, we use Transformer TTS [3] for the content encoder and the decoder, and replace the LSTM with max-pooling for computing the style coefficients. We refer to our implementation of this method as GST * .…”
Section: Methodsmentioning
confidence: 99%