2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2018
DOI: 10.23919/apsipa.2018.8659543
|View full text |Cite
|
Sign up to set email alerts
|

Error Reduction Network for DBLSTM-based Voice Conversion

Abstract: So far, many of the deep learning approaches for voice conversion produce good quality speech by using a large amount of training data. This paper presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice conversion framework that can work with a limited amount of training data. We propose to implement a DBLSTM based average model that is trained with data from many speakers. Then, we propose to perform adaptation with a limited amount of target data. Last but not least, we propose an error red… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
3

Relationship

3
6

Authors

Journals

citations
Cited by 15 publications
(4 citation statements)
references
References 37 publications
0
4
0
Order By: Relevance
“…Liu et al propose to use PPG for emotional voice conversion [227]. Zhang et al also show that the average model framework can benefit from a small amount of parallel training data using an error reduction network [228]. 4) Disentangling speaker from linguistic content: In the context of voice conversion, speech can be considered as a composition of speaker voice identity and linguistic content.…”
Section: ) Non-parallel Data Of Paired Speakersmentioning
confidence: 99%
“…Liu et al propose to use PPG for emotional voice conversion [227]. Zhang et al also show that the average model framework can benefit from a small amount of parallel training data using an error reduction network [228]. 4) Disentangling speaker from linguistic content: In the context of voice conversion, speech can be considered as a composition of speaker voice identity and linguistic content.…”
Section: ) Non-parallel Data Of Paired Speakersmentioning
confidence: 99%
“…In a recent VC review paper [48], it was shown that a sufficient amount of efforts has been dedicated to transferring knowledge from ASR and TTS to improving various aspects of VC, regardless of using a seq2seq model or not. The PPGbased methods [49]- [53] and the Parratron system described in Section II-A facilitated nonparallel, any-to-one VC by utilizing ASR and TTS modules, respectively. Another line of work show that training an integrated system capable of performing either TTS or VC can boost individual performances [29], [54].…”
Section: Transfer Learning From Asr and Tts For Vcmentioning
confidence: 99%
“…Liu et al proposes to use PPG for emotional voice conversion [214]. Zhang et al also shows that the average model framework can benefit from a small amount of parallel training data using an error reduction network [215].…”
Section: Ppg Featuresmentioning
confidence: 99%