Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.564
|View full text |Cite
|
Sign up to set email alerts
|

Revisiting Over-Smoothness in Text to Speech

et al.

Abstract: Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(15 citation statements)
references
References 27 publications
0
15
0
Order By: Relevance
“…There have been multiple approaches for augmenting E2E models and training procedures to incorporate unpaired text data. Broadly speaking, these approaches use some combination of an LM trained on text data (shallow, cold, deep fusion [10,11,12,13]) and a multi-stage training procedure that incorporates unpaired data ("weak distillation" [14], "backtranslation" [15], "cycle-consistency" [16,17,18]). Each approach produces improvements in performance, but also increases some combination of model size, training and inference complexity, making it less desirable for on-device applications.…”
Section: Introductionmentioning
confidence: 99%
“…There have been multiple approaches for augmenting E2E models and training procedures to incorporate unpaired text data. Broadly speaking, these approaches use some combination of an LM trained on text data (shallow, cold, deep fusion [10,11,12,13]) and a multi-stage training procedure that incorporates unpaired data ("weak distillation" [14], "backtranslation" [15], "cycle-consistency" [16,17,18]). Each approach produces improvements in performance, but also increases some combination of model size, training and inference complexity, making it less desirable for on-device applications.…”
Section: Introductionmentioning
confidence: 99%
“…During inference stage, we take samples z from the prior distribution and feed them into the post-net reversely to generate the final mel-spectrogram. As proved in (Ren et al 2022), this flow-based module enhances the capability of modelling complex data distributions, which helps to address one-tomany mapping problem.…”
Section: Post-netmentioning
confidence: 92%
“…where Y j denotes the j th frame of the ground truth melspectrogram with length T m , and Ŷj stands for the j th frame of predicted mel-spectrogram. Note that the variance predictors simplify the acoustic target distribution by providing conditional information, thereby mitigating the one-to-many mapping issue (Ren et al 2022). We analyse the effect of variance information in our experiment section.…”
Section: Linguistic Predictormentioning
confidence: 99%
See 1 more Smart Citation
“…However, there are still many challenges and opportunities in this domain [11], particularly when it comes to exploiting large amounts of data. On the speech generation side, one of the main difficulties is to build a model that correctly aligns the phonetic and acoustic sequences, leading to a natural prosody with fluent speech and high intelligibility, while still cap-turing the prosody variations [25]. On the opposite side, automatic speech recognition systems struggle with long-tail words recognition [35], and speech vs background disentanglement [18].…”
Section: Introductionmentioning
confidence: 99%