2020
DOI: 10.48550/arxiv.2010.15084
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Speech Synthesis and Control Using Differentiable DSP

Giorgio Fabbro,
Vladimir Golkov,
Thomas Kemp
et al.

Abstract: Modern text-to-speech systems are able to produce natural and highquality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre) that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiabl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 12 publications
0
4
0
Order By: Relevance
“…In addition to the proposed taxonomy, we summarize the peer-reviewed publications following the aforementioned taxonomy in Table 1. Post-generation control Optimiation in latent space Optimization in data space [31,130,[169][170][171][172] Peri-generation control Goal-oriented control [1,3,7,[173][174][175][176][177][178][179][180][181]] Distribution-guided control [10,182,183] Controllable transformation from source data Fixed transformation Within-domain transformation [171,[184][185][186][187][188][189][190][191]] Cross-domain transformation [185,186,188,192,193] Steerable transformation Control via reference data [37,] Control via latent space [11,[215][216][217][218][219][220][221][222]…”
Section: Taxonomymentioning
confidence: 99%
See 2 more Smart Citations
“…In addition to the proposed taxonomy, we summarize the peer-reviewed publications following the aforementioned taxonomy in Table 1. Post-generation control Optimiation in latent space Optimization in data space [31,130,[169][170][171][172] Peri-generation control Goal-oriented control [1,3,7,[173][174][175][176][177][178][179][180][181]] Distribution-guided control [10,182,183] Controllable transformation from source data Fixed transformation Within-domain transformation [171,[184][185][186][187][188][189][190][191]] Cross-domain transformation [185,186,188,192,193] Steerable transformation Control via reference data [37,] Control via latent space [11,[215][216][217][218][219][220][221][222]…”
Section: Taxonomymentioning
confidence: 99%
“…EMOVIE achieves TTS transformation while controlling the emotion via the emotion embedding supervised by the emotion labels [229]. Fabbro et al [215] controls the TTS systems by decomposing the spectrogram into control variables, such as amplitude envelope, harmonic distribution, and filter coefficients. SCTKG generates essays from given topics while controlling sentiment for each sentence by injecting the sentiment information into the generator [224].…”
Section: Steerable Transformationmentioning
confidence: 99%
See 1 more Smart Citation
“…More recently, a complementary class of vocoder algorithms including source-filter models [20] and differentiable digital signal processing (DDSP) [3] explicitly model monophonic source production in a parametric manner, hearkening back to the core signal processing approaches of the past. Such techniques have been successfully extended to speech applications, either as standalone vocoders [10,4] or as part of an end-to-end system [14]. However, the main disadvantage of these approaches is that they lack a deterministic analysis procedure to accurately extract vocoder parameters directly from source audio.…”
Section: Introductionmentioning
confidence: 99%