ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683827
|View full text |Cite
|
Sign up to set email alerts
|

Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
10
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(10 citation statements)
references
References 10 publications
0
10
0
Order By: Relevance
“…Instead of providing the Mel spectrogram of the reference audio as input to the reference encoder or variational framework, as is the case for all the systems mentioned above, specific prosodic features extracted from the reference audio, such as F0, duration and loudness, can be used as input to prosody embedding networks. These prosodic features and their statistics can be extracted at utterancelevel [11,12,13] or at frame-level and phoneme-level [14,15] to achieve utterance-level or fine-grained prosody control, respectively. A semi-supervised approach utilizing both Mel spectrograms and prosodic features as inputs to a variational framework is proposed in [16].…”
Section: Related Workmentioning
confidence: 99%
“…Instead of providing the Mel spectrogram of the reference audio as input to the reference encoder or variational framework, as is the case for all the systems mentioned above, specific prosodic features extracted from the reference audio, such as F0, duration and loudness, can be used as input to prosody embedding networks. These prosodic features and their statistics can be extracted at utterancelevel [11,12,13] or at frame-level and phoneme-level [14,15] to achieve utterance-level or fine-grained prosody control, respectively. A semi-supervised approach utilizing both Mel spectrograms and prosodic features as inputs to a variational framework is proposed in [16].…”
Section: Related Workmentioning
confidence: 99%
“…Explicit control over prosodic features such as F0 and duration by extracting these features and using them as input to a prosody encoder has been implemented for utterance-level [23,11,22] and more fine-grained [31,20,15,35] control.…”
Section: Related Workmentioning
confidence: 99%
“…Speech varies in expressions; however, these models only focus on the generation of narrative-style speech. Therefore, many researches have been recently proposed to control the prosody and speaking speed of the synthesized speech in a TTS system [5][6][7][8][9][10]. This paper focuses on the control of speaking speed that is essential for real scenario because the speaking speed must vary depending on the context or situation.…”
Section: Introductionmentioning
confidence: 99%
“…In [9,10], neural TTS systems that control the phonemelevel speech duration have been proposed. Phoneme duration is additionally inputted to the TTS system [9], or the hidden states of the phoneme sequence are expanded, corresponding to the phoneme duration [10].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation