2004
DOI: 10.1007/978-3-540-24630-5_38
|View full text |Cite
|
Sign up to set email alerts
|

Two-Level Alignment by Words and Phrases Based on Syntactic Information

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 9 publications
0
6
0
Order By: Relevance
“…For example, autoregressive (AR) TTS models [1,2,3] find alignments by themselves using attention mechanisms. On the contrary, non-autoregressive (NAR) TTS family [4,5,6] uses external alignment search algorithms [7,8] and phoneme-wise duration predictors for length regulation. As a sentence can be spoken in various ways, representing and controlling the diversity of speech are also crucial issues for TTS.…”
Section: Introductionmentioning
confidence: 99%
“…For example, autoregressive (AR) TTS models [1,2,3] find alignments by themselves using attention mechanisms. On the contrary, non-autoregressive (NAR) TTS family [4,5,6] uses external alignment search algorithms [7,8] and phoneme-wise duration predictors for length regulation. As a sentence can be spoken in various ways, representing and controlling the diversity of speech are also crucial issues for TTS.…”
Section: Introductionmentioning
confidence: 99%
“…For practical applicability, we extend Fre-Painter to a twostage TTS system. In conventional two-stage TTS system, an acoustic model generates a Mel-spectrogram as an intermediate representation [68], [69], and then a neural vocoder synthesizes an audio waveform from the Mel-spectrogram. Additionally, if audio super-resolution is performed using models that take an audio waveform as input, a total of three stages are involved.…”
Section: F Text-to-speech Synthesis With Audio Super-resolutionmentioning
confidence: 99%
“…Within this approach, text embeddings are duplicated according to their pre-determined durations to align with speech frames. For training, ground truth durations are obtained from the pairs of text and speech using external monotonic alignment algorithms [7], [8]. During inference, when ground truth durations are inaccessible, an explicit duration predictor infers durations from text representations instead.…”
Section: B Alignment Modeling In Neural Ttsmentioning
confidence: 99%
“…Attention-based AR models, like those proposed in [1]- [3], operate using an AR model that predicts speech in a frame-by-frame manner and utilizes an attention mechanism to establish alignment. In contrast, duration-based NAR models, such as [4]- [6], require phoneme-wise duration to regulate speech frame length and generate frames in parallel, necessitating external alignment search algorithms [7], [8] and explicit duration predictors to obtain durations.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation