2015
DOI: 10.1007/s11042-015-3039-x
|View full text |Cite
|
Sign up to set email alerts
|

High quality voice conversion using prosodic and high-resolution spectral features

Abstract: Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and p… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(13 citation statements)
references
References 45 publications
0
13
0
Order By: Relevance
“…The alignment is usually applied directly by Dynamic Time Wrapping (DTW) [17]. Also, there are techniques to get a more accurate feature alignment with the help of automatic speech recognition (ASR) techniques [18,14,19]. The aligned feature sequences x = x1, .., xT and y = y1, .., yT are then converted frame by frame in different methods (e.g.…”
Section: Parallel Data Voice Conversionmentioning
confidence: 99%
“…The alignment is usually applied directly by Dynamic Time Wrapping (DTW) [17]. Also, there are techniques to get a more accurate feature alignment with the help of automatic speech recognition (ASR) techniques [18,14,19]. The aligned feature sequences x = x1, .., xT and y = y1, .., yT are then converted frame by frame in different methods (e.g.…”
Section: Parallel Data Voice Conversionmentioning
confidence: 99%
“…Voice conversion, modifying the recorded speech of a source speaker toward a given target speaker, is a popular way to achieve such voice personalization. In this special issue, Nguyen et al [8] provide a comprehensive voice conversion framework using deep neural networks to convert both timbre and prosodic features. Experiments show that the use of prosodic and high-resolution spectral features leads to high-quality converted speech.…”
Section: The Importance Of Speechmentioning
confidence: 99%
“…VC may be categorized into two categories: one with parallel corpus (i.e., both source and target speakers utter the same sentences) and the other with non-parallel corpus (i.e., the target speakers utter different sentences from the source speaker) tasks according to whether the source and target speakers speak the same texts. Many existing approaches that have yielded conversion results with both high quality and high similarity are based on parallel data, such as Gaussian mixture models (GMM) [2,11], frequency warping (FW) [12][13][14], deep neural networks (DNN) [15][16][17], non-negative matrix factorization (NMF) [18,19], and so on. This paper also focuses on VC with parallel training data.…”
Section: Introductionmentioning
confidence: 99%