2019
DOI: 10.1109/access.2019.2912926
|View full text |Cite
|
Sign up to set email alerts
|

A BLSTM and WaveNet-Based Voice Conversion Method With Waveform Collapse Suppression by Post-Processing

Abstract: In recent years, neural network-based voice conversion methods have been rapidly developed, and many different models and neural networks have been applied in parallel voice conversion. However, the over-smoothing of parametric methods [e.g., bidirectional long short-term memory (BLSTM)] and the waveform collapse of neural vocoders (e.g., WaveNet) still have negative impacts on the quality of the converted voices. To overcome this problem, we propose a BLSTM and WaveNet-based voice conversion method cooperated… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 23 publications
0
4
0
Order By: Relevance
“…During the training phase, the input voice is first decomposed into acoustic features, such as fundamental frequency (F0), spectral envelope, and aperiodic components [14], and conversion functions are subsequently estimated to bridge the acoustic features obtained from the parallel corpus of the source speaker and the target speaker. During the conversion phase, the conversion function is applied on features extracted from the new input voice [10]. Finally, a converted speech waveform is generated from the converted acoustic features by implementing a vocoder [21].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…During the training phase, the input voice is first decomposed into acoustic features, such as fundamental frequency (F0), spectral envelope, and aperiodic components [14], and conversion functions are subsequently estimated to bridge the acoustic features obtained from the parallel corpus of the source speaker and the target speaker. During the conversion phase, the conversion function is applied on features extracted from the new input voice [10]. Finally, a converted speech waveform is generated from the converted acoustic features by implementing a vocoder [21].…”
Section: Related Workmentioning
confidence: 99%
“…BLSTM is an improvement of the bidirectional recurrent neural network (RNN), which can model a certain amount of contextual information with cyclic connections and map the whole history of previous inputs to each output in principle [10]. However, conventional RNNs can access only a limited range of context because of the gradient explosion or vanishing over time in long-range contextual transmission.…”
Section: Blstm-based Vcmentioning
confidence: 99%
See 1 more Smart Citation
“…Although there has been an increasing amount of speaker identification techniques based on Convolutional Neural Networks (CNN), Bi-directional Long Short-Term Memory networks (BLSTM) have rarely been used for this purpose, while they have provided good results in other audio applications, such as voice conversion [9], sound source separation [10] and speech recognition [11]. An important aspect of BLSTM is their re-use of weights in there inner processes for modeling temporal data, which results in a small amount of parameters.…”
Section: Introductionmentioning
confidence: 99%