Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1655
|View full text |Cite
|
Sign up to set email alerts
|

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 12 publications
0
2
0
1
Order By: Relevance
“…A temporal network is modeled using the transformer architecture introduced in [27] to process the multi-feature fusion vectors generated in the previous step. Due to its capacity to capture global contextual information and parallelization capabilities [28][29][30][31], the transformer model is widely employed in the natural language processing and computer vision domains. Therefore, we employ the transformer model as one of the modules for extracting temporal information from the network.…”
Section: Feature Extractionmentioning
confidence: 99%
“…A temporal network is modeled using the transformer architecture introduced in [27] to process the multi-feature fusion vectors generated in the previous step. Due to its capacity to capture global contextual information and parallelization capabilities [28][29][30][31], the transformer model is widely employed in the natural language processing and computer vision domains. Therefore, we employ the transformer model as one of the modules for extracting temporal information from the network.…”
Section: Feature Extractionmentioning
confidence: 99%
“…The frontend takes plain text as input and emits its phonetic representation (grapheme to phoneme) and an additional prosodic information for each token in the output sequence. These are fed into the acoustic backend that consists of a transformer-based prosody model which predicts phone-level F0 values and duration, and a transformer-based spectral model that produces frame-level mel-cepstral coefficients, F0 and periodicity features, and finally a sparse WaveRNN-based neural vocoder that predicts the final waveform [20]. The acoustic model is trained on a multi-speaker dataset consisting of 170 hours of TTS-quality audio across 96 English speakers, while the WaveRNN vocoder was trained for each speaker separately.…”
Section: Ttsmentioning
confidence: 99%
“…Подобные нейронные сети нашли применение во всех основных задачах глубокого обучения -от обработки естественных языков [15] до автоматической сегментации изображений [16,17], постепенно вытесняя предыдущие поколения глубоких нейронных сетей, основанные на преобразованиях свертки и рекуррентных слоях. В задачах обработки речи преобразование самовнимания применялось для автоматического распознавания голоса [18], оценки эмоциональной окраски [19] и синтеза речи [20]. В задаче улучшения качества речи данный механизм использовался как замена части рекуррентных и сверточных преобразований [9].…”
Section: постановка задачиunclassified