Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020 2020
DOI: 10.21437/vcc_bc.2020-28
|View full text |Cite
|
Sign up to set email alerts
|

The Academia Sinica Systems of Voice Conversion for VCC2020

Abstract: This paper describes the Academia Sinica systems for the two tasks of Voice Conversion Challenge 2020, namely voice conversion within the same language (Task 1) and cross-lingual voice conversion (Task 2). For both tasks, we followed the cascaded ASR+TTS structure, using phonetic tokens as the TTS input instead of the text or characters. For Task 1, we used the international phonetic alphabet (IPA) as the input of the TTS model. For Task 2, we used unsupervised phonetic symbols extracted by the vector-quantize… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 18 publications
0
3
0
Order By: Relevance
“…The typical usage of S3R models is to extract continuous [32]- [36], [42] features for downstream tasks. However, due to the lack of supervision, continuous S3Rs lack the ability to fully separate contents from other factors such as speaker identity, resulting in poor performance in the A2A setting [29].…”
Section: A Recognition-synthesis Based Voice Conversionmentioning
confidence: 99%
See 1 more Smart Citation
“…The typical usage of S3R models is to extract continuous [32]- [36], [42] features for downstream tasks. However, due to the lack of supervision, continuous S3Rs lack the ability to fully separate contents from other factors such as speaker identity, resulting in poor performance in the A2A setting [29].…”
Section: A Recognition-synthesis Based Voice Conversionmentioning
confidence: 99%
“…One way to provide the sufficient disentanglement is through discretization, as shown in [40]. Certain S3R models such as VQVAE [17] or vq-wav2vec [41] are able to generate discrete outputs due to their architecture, thus some have therefore proposed VC systems based on them [33], [42]. However, not all S3R models have such discretization design.…”
Section: A Recognition-synthesis Based Voice Conversionmentioning
confidence: 99%
“…VQW2V: Features such as BNF are derived from an ASR model, which requires supervision using labels; this increases the cost of building such a system, especially in low-resource or crosslingual settings. Alternatively, several studies [17][18][19] adopted self-supervised representations that do not require any label during training while still being speaker-independent and framewise. In this work, we adopt the vector-quantized wav2vec (VQW2V) [20].…”
Section: Intermediate Representationmentioning
confidence: 99%