ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747239
|View full text |Cite
|
Sign up to set email alerts
|

Voice Filter: Few-Shot Text-to-Speech Speaker Adaptation Using Voice Conversion as a Post-Processing Module

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 20 publications
0
3
0
Order By: Relevance
“…Each of these recordings are paired with a synthetically generated audio of a target speaker, to train the manyto-one voice conversion model. These syntheses have been generated using a similar technique of the one described in this paper [9]. As mentioned in the section 4.6 of the manuscript, the evaluation set is composed of 3500 utterances, which were held out from the training set.…”
Section: Downstream Application 2: Text-less Intelligibility Evaluati...mentioning
confidence: 99%
“…Each of these recordings are paired with a synthetically generated audio of a target speaker, to train the manyto-one voice conversion model. These syntheses have been generated using a similar technique of the one described in this paper [9]. As mentioned in the section 4.6 of the manuscript, the evaluation set is composed of 3500 utterances, which were held out from the training set.…”
Section: Downstream Application 2: Text-less Intelligibility Evaluati...mentioning
confidence: 99%
“…For objective evaluation, we first calculate the average Speaker Embedding Cosine Similarity (SECS) between the reference and measured audios by a speaker verification model [24] to estimate speaker similarity. Further, we compute Conditional Fréchet Speech Distance (CFSD) [14] between the generated speech and actual recording to measure signal quality. Besides, we also evaluate mean square error for pitch (MSE P ) and duration (MSE D ) to access prosody similarity.…”
Section: Evaluation Metricsmentioning
confidence: 99%
“…lated to speaker identity [10,11,12,13]. The other alternative approach is based on using a light voice conversion postprocessing module to baseline TTS model [14]. The third challenge is to reduce amount of speech required to add new speaker to existing TTS model.…”
Section: Introductionmentioning
confidence: 99%