Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1972
|View full text |Cite
|
Sign up to set email alerts
|

Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

Abstract: Neural TTS has demonstrated strong capabilities to generate human-like speech with high quality and naturalness, while its generalization to out-of-domain texts is still a challenging task, with regard to the design of attention-based sequence-tosequence acoustic modeling. Various errors occur in those inputs with unseen context, including attention collapse, skipping, repeating, etc., which limits the broader applications. In this paper, we propose a novel stepwise monotonic attention method in sequence-to-se… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
57
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 67 publications
(58 citation statements)
references
References 12 publications
1
57
0
Order By: Relevance
“…The general form of this type of attention is shown in (5), where wi, Zi, ∆i, and σi are computed from the attention RNN state. The mean of each Gaussian component is computed using the recurrence relation in (6), which makes the mechanism location-relative and potentially monotonic if ∆i is constrained to be positive.…”
Section: Gmm-based Mechanismsmentioning
confidence: 99%
See 2 more Smart Citations
“…The general form of this type of attention is shown in (5), where wi, Zi, ∆i, and σi are computed from the attention RNN state. The mean of each Gaussian component is computed using the recurrence relation in (6), which makes the mechanism location-relative and potentially monotonic if ∆i is constrained to be positive.…”
Section: Gmm-based Mechanismsmentioning
confidence: 99%
“…Approaches based on the seminal Tacotron system [3] have demonstrated naturalness that rivals that of human speech for certain domains [4]. Despite these successes, there are sometimes complaints of a lack of robustness in the alignment procedure that leads to missing or repeating words, incomplete synthesis, or an inability to generalize to longer utterances [5,6,7]. The original Tacotron system [3] used the content-based attention mechanism introduced in [2] to align the target text with the output spectrogram.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Many soft-attention mechanisms have been proposed to stabilize alignments to be monotonic (e.g. [5,6]). Hard monotonic alignment is an alternative alignment method to avoid fatal alignment errors [7].…”
Section: Introductionmentioning
confidence: 99%
“…The use of scheduled sampling comes with negative effects that include misalignment between the natural speech frames and the predicted frames due to the fact that the temporal dependency of the acoustic sequence is disrupted. The techniques to improve out-ofdomain performance include the GAN-based TTS framework [28] that introduces both real and generated data sequences in discriminator training, and more recently, stepwise monotonic attention for neural TTS [8].…”
Section: Introductionmentioning
confidence: 99%