ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746484
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

Abstract: Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 52 publications
(11 citation statements)
references
References 42 publications
0
11
0
Order By: Relevance
“…Initially, SSL models were primarily used in speech recognition [6], [7]. Subsequently, similar approach was successfully extended to another speech processing tasks such as language, emotion, speaker recognition [20], [21] and VC [22]. While the real-time scenarios are highly important use cases for ASR models, the streaming scenario is often challenging for such models since SSL pre-training procedure is performed on full-length files without streaming mode adaptation.…”
Section: B Ssl Modelsmentioning
confidence: 99%
“…Initially, SSL models were primarily used in speech recognition [6], [7]. Subsequently, similar approach was successfully extended to another speech processing tasks such as language, emotion, speaker recognition [20], [21] and VC [22]. While the real-time scenarios are highly important use cases for ASR models, the streaming scenario is often challenging for such models since SSL pre-training procedure is performed on full-length files without streaming mode adaptation.…”
Section: B Ssl Modelsmentioning
confidence: 99%
“…Textless-NLP [34,35] and Au-dioLM [5] do not use text transcriptions or phoneme symbols in speech processing systems; they use discrete units constructed by self-supervised learning. Soft discrete unit is another approach for textless speech processing [57].…”
Section: Textless-nlpmentioning
confidence: 99%
“…• Voice conversion: We measure conversion intelligibility following [46], [47], whereby we perform voice conversion and then apply a speech recognition system to the output and compute a character error rate (CER) and F 1 classification score to the word spoken in the original utterance. Speaker similarity is measured as described in [46] whereby we find similarity scores between real/generated utterance pairs using a trained speaker classifier, and then compute an EER with real/generated scores assigned a label of 0 and real/real pair scores assigned a label of 1.…”
Section: Experimental Setup: Unseen Tasks a Evaluation Metricsmentioning
confidence: 99%
“…• Voice conversion: We measure conversion intelligibility following [46], [47], whereby we perform voice conversion and then apply a speech recognition system to the output and compute a character error rate (CER) and F 1 classification score to the word spoken in the original utterance. Speaker similarity is measured as described in [46] whereby we find similarity scores between real/generated utterance pairs using a trained speaker classifier, and then compute an EER with real/generated scores assigned a label of 0 and real/real pair scores assigned a label of 1. • Speech enhancement: Given a series of original clean and noisy utterances, and the models' denoised output, we compute standard measures of denoising performance: narrow-band perceptual evaluation of speech quality (PESQ) [48] and short term objective intelligibility (STOI) scores [49].…”
Section: Experimental Setup: Unseen Tasks a Evaluation Metricsmentioning
confidence: 99%