Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora

Godoy, Elizabeth; Rosec, Olivier; Chonavel, Thierry

doi:10.1109/tasl.2011.2177820

Cited by 102 publications

(74 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, the ML-GMM method is a well-established baseline method in the voice conversion research. In the frequency warping methods, the weighted frequency warping with amplitude scaling (WFW-AS) has been reported to achieve comparable performance to ML-GMM in terms of speaker similarity [39]. Hence, ML-GMM, DKPLS, and WFW-AS could be good choices to simulate voice conversion spoofing attacks when the training data are limited, although not all of them have been applied to spoofing attacks.…”

Section: Discussionmentioning

confidence: 99%

Voice conversion versus speaker verification: an overview

2014

SIP

View full text Add to dashboard Cite

A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a good insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks; therefore, presents a threat to speaker verification systems. In this paper, we will briefly introduce the fundamentals of voice conversion and speaker verification technologies. We then give an overview of recent spoofing attack studies under different conditions with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.

show abstract

Section: Discussionmentioning

confidence: 99%

Voice conversion versus speaker verification: an overview

2014

SIP

View full text Add to dashboard Cite

show abstract

“…As alternatives to data-driven statistical conversion methods, frequency warping based approaches to voice conversion were introduced in (Toda et al, 2001;Sundermann and Ney, 2003;Erro et al, 2010;Godoy et al, 2012;Erro et al, 2013). Rather than directly substituting the spectral characteristics of the input speech signal, these techniques effectively warp the frequency axis of a source spectrum to match that of the target.…”

Section: Voice Conversionmentioning

confidence: 99%

“…Frequency warping approaches tend to retain spectral details and produce high quality converted speech. A so-called Gaussian-dependent filtering approach to voice conversion introduced in (Matrouf et al, 2006;Bonastre et al, 2007) is related to amplitude scaling (Godoy et al, 2012) within a frequency warping framework.…”

Section: Voice Conversionmentioning

confidence: 99%

Spoofing and countermeasures for speaker verification: A survey

Evans

Kinnunen

et al. 2015

Speech Communication

553

349

View full text Add to dashboard Cite

While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develop spoofing countermeasures for automatic speaker verification, now that the technology has matured sufficiently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and identifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent efforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.

show abstract

“…We conducted subjective quality evaluations in a format similar to multi-stimulus test with hidden reference and anchor (MUSHRA) [32]. The listeners were presented with four test signals: (a) a hidden reference-the target speaker, (b) enhanced JGMM, (c) CGMM, and (d) En-GB.…”

Section: Subjective Evaluationsmentioning

confidence: 99%

Grid-based approximation for voice conversion in low resource environments

Benisty

Malah

Crammer

2016

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

The goal of voice conversion is to modify a source speaker's speech to sound as if spoken by a target speaker. Common conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically model the spectral structure of the source and target signals and require relatively large training sets (typically dozens of sentences) to avoid over-fitting. Moreover, they often lead to muffled synthesized output signals, due to excessive smoothing of the spectral envelopes. Mobile applications are characterized with low resources in terms of training data, memory footprint, and computational complexity. As technology advances, computational and memory requirements become less limiting; however, the amount of available training data still presents a great challenge, as a typical mobile user is willing to record himself saying just few sentences. In this paper, we propose the grid-based (GB) conversion method for such low resource environments, which is successfully trained using very few sentences (5-10). The GB approach is based on sequential Bayesian tracking, by which the conversion process is expressed as a sequential estimation problem of tracking the target spectrum based on the observed source spectrum. The converted Mel frequency cepstrum coefficient (MFCC) vectors are sequentially evaluated using a weighted sum of the target training vectors used as grid points. The training process includes simple computations of Euclidian distances between the training vectors and is easily performed even in cases of very small training sets. We use global variance (GV) enhancement to improve the perceived quality of the synthesized signals obtained by the proposed and the GMM-based methods. Using just 10 training sentences, our enhanced GB method leads to converted sentences having closer GV values to those of the target and to lower spectral distances at the same time, compared to enhanced version of the GMM-based conversion method. Furthermore, subjective evaluations show that signals produced by the enhanced GB method are perceived as more similar to the target speaker than the enhanced GMM signals, at the expense of a small degradation in the perceived quality.

show abstract

Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora

Cited by 102 publications

References 22 publications

Voice conversion versus speaker verification: an overview

Voice conversion versus speaker verification: an overview

Spoofing and countermeasures for speaker verification: A survey

Grid-based approximation for voice conversion in low resource environments

Contact Info

Product

Resources

About