Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-208
|View full text |Cite
|
Sign up to set email alerts
|

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

Abstract: We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for DVC, which is highly flexible in that no normal speech of the patient is required. First, a powerful parallel sequen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 33 publications
0
7
0
Order By: Relevance
“…The goal in the first stage is to completely capture the characteristics of the dysarthric speech. Following [6], we adopted the VTN [1,7], a Transformer-based [8] seq2seq model tailored for VC. When a parallel corpus is available, seq2seq modeling is considered state-ofthe-art due to its ability to convert the prosodic structures in speech, which is critical in N2D VC.…”
Section: Many-to-one Seq2seq Modelingmentioning
confidence: 99%
See 3 more Smart Citations
“…The goal in the first stage is to completely capture the characteristics of the dysarthric speech. Following [6], we adopted the VTN [1,7], a Transformer-based [8] seq2seq model tailored for VC. When a parallel corpus is available, seq2seq modeling is considered state-ofthe-art due to its ability to convert the prosodic structures in speech, which is critical in N2D VC.…”
Section: Many-to-one Seq2seq Modelingmentioning
confidence: 99%
“…This technique is flexible in that the VC corpus and the pretraining TTS dataset can be completely different in terms of speaker and content, even when trained between normal and dysarthric speakers. In [6], it was shown that training using only 15 minutes of speech from each speaker can yield good results.…”
Section: Many-to-one Seq2seq Modelingmentioning
confidence: 99%
See 2 more Smart Citations
“…Previous studies of dysarthric VC (DVC) have largely consisted of partial least squares regression-(PLS) [5], Gaussian mixture model (GMM) [4], or deep neural network-based (DNN) [6,7] parallel methods. There are also some methods that incorporate non-parallel VC methods as part of a parallel VC system [8,9].…”
Section: Introductionmentioning
confidence: 99%