2021
DOI: 10.1109/taslp.2021.3066047
|View full text |Cite
|
Sign up to set email alerts
|

Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data

Abstract: We present a novel voice conversion (VC) framework by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning or TTL-VC for short. We first develop a multi-speaker speech synthesis system with sequence-to-sequence encoder-decoder architecture, where the encoder extracts the linguistic representations of input text, while the decoder, conditioned on target speaker embedding, takes the context vectors and the attention recurrent network cell output to generate target acoust… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 43 publications
(17 citation statements)
references
References 58 publications
(76 reference statements)
0
17
0
Order By: Relevance
“…The evaluations saw good results, although subjective, proved that classical methods such as only frequency warping has difficulty in competing with the likes of NN. Among contemporary research is [31] which proposes to utilize Text-to-Speech (TTS) as an approach to training a model via a Recurrent Neural Network (RNN) vocoder for transfer learning. Their results show improvement over simpler techniques, and a slight improvement over similar (TTS-utilizing) methods.…”
Section: Comparing Results To Related Modern Workmentioning
confidence: 99%
“…The evaluations saw good results, although subjective, proved that classical methods such as only frequency warping has difficulty in competing with the likes of NN. Among contemporary research is [31] which proposes to utilize Text-to-Speech (TTS) as an approach to training a model via a Recurrent Neural Network (RNN) vocoder for transfer learning. Their results show improvement over simpler techniques, and a slight improvement over similar (TTS-utilizing) methods.…”
Section: Comparing Results To Related Modern Workmentioning
confidence: 99%
“…Studies also show that voice conversion benefits from the knowledge about linguistic content in the speech. For example, speaker voice conversion successfully leverages TTS [132,20,133] or ASR systems [134,135] that are phoneticallyinformed and trained on large speech corpus.…”
Section: Leveraging Tts or Asr Systemsmentioning
confidence: 99%
“…Furthermore, the scope of the research is also confined to LA attacks as synthetic speech production is becoming more accessible and capturing naturality. This is due to the fact that open source tools and datasets are available for researchers to explore leading to more versatile synthetic speech generators [5], [23], [34], [35].…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, the countermeasures developed so far are less than a decade old and still have a scope of improvement in terms of reducing the False Acceptance ratios. Most of the research is based on specific type of attack [5], [6] while few others consider all the types of attack making them universal detectors [7], [8].…”
Section: Introductionmentioning
confidence: 99%