Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10740
|View full text |Cite
|
Sign up to set email alerts
|

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Abstract: Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained with non-parallel and unlabeled speech data. However, previous approaches perform disentanglement only implicitly via some sort of information bottleneck or normalization, where it is usually hard to find a good trade-off between voice conversion and content reconstruction. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 28 publications
0
4
0
Order By: Relevance
“…The last experiment assesses naturalness, intelligibility and speaker similarity. We compare Urhythmic to three state-of-the-art unsupervised rhythm conversion systems: AutoPST [9], UnsupSeg [10], and DISSC [11]. We use the official pretrained models for each baseline.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The last experiment assesses naturalness, intelligibility and speaker similarity. We compare Urhythmic to three state-of-the-art unsupervised rhythm conversion systems: AutoPST [9], UnsupSeg [10], and DISSC [11]. We use the official pretrained models for each baseline.…”
Section: Methodsmentioning
confidence: 99%
“…However, training these systems requires parallel speech or text transcriptions, which are costly and time-consuming to collect. Unsupervised methods such as AutoPST [9], UnsupSeg [10], and DISSC [11] lift this restriction by modeling rhythm without annotations or parallel data. However, there is still a gap in quality and prosody compared to natural speech.…”
Section: Introductionmentioning
confidence: 99%
“…Other methods only linearly alter the speaking rate (Kuhlmann et al, 2022) thus ignoring the change of rhythm for different content.…”
Section: Related Workmentioning
confidence: 99%
“…Traditional VC methods mainly focused on changing the timbre of a given speaker while leaving the speaking style unchanged (Stylianou et al, 1998;Kain and Macon, 1998;Nakashika et al, 2013;Chou et al, 2019;Huang et al, 2021). Recent methods propose to additionally convert speaking style (Qian et al, 2020;Chen and Duan, 2022;Qian et al, 2021;Kuhlmann et al, 2022). However, these mainly use only a single target utterance which does not fully capture speaker prosody.…”
Section: Introductionmentioning
confidence: 99%