2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2011
DOI: 10.1109/icassp.2011.5947510
|View full text |Cite
|
Sign up to set email alerts
|

One sentence voice adaptation using GMM-based frequency-warping and shift with a sub-band basis spectrum model

Abstract: This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shift with parameters of a subband basis spectrum model (SBM)[1]. The SBM parameter represents a shape of a spectrum of speech. It is calculated by fitting a sub-band basis to the log-spectrum. Since the parameter is the frequency domain representation, frequency warping can be directly applied to the SBM parameter. A frequency warping function that minimize the distance between source and target SBM parameter pairs in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2012
2012
2022
2022

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 6 publications
0
2
0
Order By: Relevance
“…Sekii et al [26] sees good MOS results for their DNN approach, but execution time is not ideal for real-time use as it requires time to extract, convert and restore features. Kotani et al [27] while using larger data and time spent training sees typical VA performance and [28] with a route of frequency warping by considering many warping factors achieves decent results compared to the then stateof-the-art statistical methods. Many speaker data sets, layers in NN and long training times across various methods all fall towards the average mark, while [29] (DNN) see mixed results regarding speaker likeness in relation to training volume.…”
Section: Comparing Results To Related Modern Workmentioning
confidence: 99%
“…Sekii et al [26] sees good MOS results for their DNN approach, but execution time is not ideal for real-time use as it requires time to extract, convert and restore features. Kotani et al [27] while using larger data and time spent training sees typical VA performance and [28] with a route of frequency warping by considering many warping factors achieves decent results compared to the then stateof-the-art statistical methods. Many speaker data sets, layers in NN and long training times across various methods all fall towards the average mark, while [29] (DNN) see mixed results regarding speaker likeness in relation to training volume.…”
Section: Comparing Results To Related Modern Workmentioning
confidence: 99%
“…Such functions are normally trained from constant-dimension acoustic feature vectors provided by a vocoder. In many other solutions, the conversion function can be applied only to a specific speech signal representation: modification of formant frequencies and bandwidths [17,18], frequency warping (FW) [19,20,21], FW followed by amplitude scaling (AS) [22,23], etc. When footprint is not an issue, the system can keep some training data from the target speaker and then perform VC through frame selection [24], feature trajectory selection [25], or exemplar-driven transforma- tions [26].…”
Section: Introductionmentioning
confidence: 99%