2019
DOI: 10.1109/access.2019.2914149
|View full text |Cite
|
Sign up to set email alerts
|

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

Abstract: Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
19
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 30 publications
(21 citation statements)
references
References 7 publications
1
19
0
1
Order By: Relevance
“…1. Both GA and Force-Align model perform much better than the baseline, which is consistent with that in [18] and [19]. And MAGT model receives the most preferences compared to other models, which indicates the advantages of the multi-alignment guided attention.…”
Section: Evaluation Of the Training Strategiessupporting
confidence: 73%
See 2 more Smart Citations
“…1. Both GA and Force-Align model perform much better than the baseline, which is consistent with that in [18] and [19]. And MAGT model receives the most preferences compared to other models, which indicates the advantages of the multi-alignment guided attention.…”
Section: Evaluation Of the Training Strategiessupporting
confidence: 73%
“…Specifically, two extra alignment terms are introduced to penalize the attention learning if the learned alignment does not match with the guided alignments. Formally, we define two matrices A * , A to represent corresponding two guided alignments, where A * is achieved beforehand by force-aligning using ASR [19], A it's created with the assumption that the alignment between the input text sequence and output sequence should be "nearly diagonal" [18]. That is, A is defined as follows:…”
Section: Multi-alignment Guided Attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…We conduct experiments investigating the robustness of other attention mechanisms, including forward attention (Zhang, Ling, and Dai 2018), GMM attention (Graves 2013), forced monotonic mechanism (Raffel et al 2017) and guided attention (Zhu et al 2019). All these mechanisms generate bad cases, therefore none of them could be part of our robust model.…”
Section: Other Attention Mechanismsmentioning
confidence: 99%
“…Therefore, even if the same sentence is continuously repeated during the same period of time, it is impossible for the collected audio signal to be completely similar every time. However, in related applications to robots, most employ the fixed text-to-speech (TTS) mode 26,27 technology to enable users to understand robots’ expressions by listening. Although TTS technology is increasingly maturing, and users can easily understand its content, the audio signals generated by TTS usually exhibit relatively flat intonation and generate the same audio signals each time they are played.…”
Section: Concept Of the Proposed Voice Generatormentioning
confidence: 99%