2022
DOI: 10.3390/info13030102
|View full text |Cite
|
Sign up to set email alerts
|

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

Abstract: In this publication, we present a deep learning-based method to transform the f0 in speech and singing voice recordings. f0 transformation is performed by training an auto-encoder on the voice signal’s mel-spectrogram and conditioning the auto-encoder on the f0. Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f0 from its latent code. The resulting model successfully applies the desired f0 to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 54 publications
0
7
0
Order By: Relevance
“…We can use a bottleneck auto-encoder [14] to disentangle the voice intensity from the mel-spectrogram of singing voice recordings. We extend the architecture of [16] to additionally include the voice intensity as conditional input.…”
Section: Proposed Intensity Transformationsmentioning
confidence: 99%
See 4 more Smart Citations
“…We can use a bottleneck auto-encoder [14] to disentangle the voice intensity from the mel-spectrogram of singing voice recordings. We extend the architecture of [16] to additionally include the voice intensity as conditional input.…”
Section: Proposed Intensity Transformationsmentioning
confidence: 99%
“…A detailed description of this bottleneck auto-encoder is given in [16] so we only give a brief outline here: The autoencoder consists of a pair of networks, an encoder and a decoder which are cascaded (the input of the decoder is the output of the encoder). Additionally the decoder receives a conditional input, which in this case consists of the f0, voiced-unvoiced mask and the intensity from Section 2.…”
Section: Proposed Intensity Transformationsmentioning
confidence: 99%
See 3 more Smart Citations