Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1356
|View full text |Cite
|
Sign up to set email alerts
|

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 37 publications
(12 citation statements)
references
References 0 publications
0
12
0
Order By: Relevance
“…One possible reason is that in the A2A VC setting, modern S3Rs still fail to disentangle content, such that the synthesizer preserves too much speaker information. Another reason may be that a jointly trained speaker encoder [10] is essential for S3R-based VC.…”
Section: Results On Different Tasksmentioning
confidence: 99%
See 2 more Smart Citations
“…One possible reason is that in the A2A VC setting, modern S3Rs still fail to disentangle content, such that the synthesizer preserves too much speaker information. Another reason may be that a jointly trained speaker encoder [10] is essential for S3R-based VC.…”
Section: Results On Different Tasksmentioning
confidence: 99%
“…ASR+TTS [4] was the seq2seq+non-AR vocoder baseline in VCC2020. S2VC [10] is the STOA system for A2A VC. The results are shown in Table 2.…”
Section: Comparing With Top Systems Using Subjective Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…To understand how Retriever models the content, the style, and the bipartite graph between them, we visualize the VQ codes and the decoder link attention map for a 2.5s-long test utterance in (Chou & Lee, 2019) 46.5% 1.86±0.10 2.26±0.14 FragmentVC (Lin et al, 2021b) 89.5% 3.43±0.12 3.54±0.15 S2VC (Lin et al, 2021a) 96.8% 3.18±0.12 3.36±0.15 Retriever 99.4% 3.44±0.13 3.84±0.14…”
Section: Visualization Of Content-style Representationmentioning
confidence: 99%
“…F.1 IMPLEMENTATION DETAILSFor the tokenization module, we followLin et al (2021a), using s3prl toolkit 1 to extract the CPC feature, and use a depth-wise convolutional layer with kernel size 15 after the CPC feature. The depth-wise convolutional layer is trainable, while the CPC model is fixed during training.…”
mentioning
confidence: 99%