Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization

Aihara, Ryo; Fujii, Takao; Nakashika, Toru; Takiguchi, Tetsuya; Ariki, Yasuo

doi:10.1186/s13636-015-0075-4

Cited by 8 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the non-parallel setting, the initialization is based on the NMF and NTD frameworks. This initialization method uses an adaptive matrix [42]. Finally, initialized parameters are optimized by Eqs.…”

Section: Conditionsmentioning

confidence: 99%

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

Takashima

Nakashika

Takiguchi

et al. 2019

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained using parallel data which results in the speech data requiring elaborate pre-processing to generate parallel data. NMF-VC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTD-based dictionary-learning method estimates the dictionary matrix for NMF-VC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and non-parallel settings.

show abstract

Section: Conditionsmentioning

confidence: 99%

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

Takashima

Nakashika

Takiguchi

et al. 2019

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Non-negative matrix factorization (NMF) [9,31,32] assumes that the speech can be expressed with exemplars and corresponding weights. NMF builds a dictionary consisting of corresponding exemplars from source speech and target speech.…”

Section: Introductionmentioning

confidence: 99%

Noise-robust voice conversion with domain adversarial training

Du¹,

Xie²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Voice conversion has made great progress in the past few years under the studio-quality test scenario in terms of speech quality and speaker similarity. However, in real applications, test speech from source speaker or target speaker can be corrupted by various environment noises, which seriously degrade the speech quality and speaker similarity. In this paper, we propose a novel encoderdecoder based noise-robust voice conversion framework, which consists of a speaker encoder, a content encoder, a decoder, and two domain adversarial neural networks. Specifically, we integrate disentangling speaker and content representation technique with domain adversarial training technique. Domain adversarial training makes speaker representations and content representations extracted by speaker encoder and content encoder from clean speech and noisy speech in the same space, respectively. In this way, the learned speaker and content representations are noise-invariant. Therefore, the two noise-invariant representations can be taken as input by the decoder to predict the clean converted spectrum. The experimental results demonstrate that our proposed method can synthesize clean converted speech under noisy test scenarios, where the source speech and target speech can be corrupted by seen or unseen noise types during the training process. Additionally, both speech quality and speaker similarity are improved.

show abstract

“…Depending on whether there are same utterance pairs in the training dataset, voice conversion can be categorized into two types, a parallel one and a non-parallel one. The early studies [5,6,7,8,9] are focused on parallel voice conversion by building the spectrum mapping between the source and target speaker. Among them, the statistical parametric approaches like Gaussian mixture model (GMM) [5,6] and partial least square regression [7] use the statistical model to learn the mapping between the source and target spectrum.…”

Section: Introductionmentioning

confidence: 99%

“…However, these statistical parametric methods degrade the quality of the converted speeches due to the over-smoothing effects. Then the non-negative matrix factorization (NMF) based approaches [8,9] are proposed to address the over-smoothing effects by decomposing the spectrum into weighted linear combinations of exemplars.…”

Section: Introductionmentioning

confidence: 99%

Non-Parallel Many-To-Many Voice Conversion Using Local Linguistic Tokens

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The VQ-VAE based voice conversion models have lately received increasing attention in non-parallel many to many voice conversion, where the encoder extracts the speakerinvariant linguistic content from the input speech using vector quantization and the decoder produces the target speech from the encoder output, conditioned on the target speaker representation. However, it is challenging for the encoder to find a proper balance between removing the speaker information and preserving the linguistic content, which degrades the converted speech quality. To address this issue, we propose the Local Linguistic Tokens (LLTs) model to learn high-quality speaker-invariant linguistic embeddings using the multi-head attention module, which has shown great success in extracting speaking style embeddings in Global Style Tokens (GSTs). Instead of vector quantization, the multihead attention module makes the encoder preserve more linguistic content to enhance the converted speech quality. Both objective and subjective experimental results revealed that, compared with the state-of-the-art VQ-VAE model, the proposed LLTs model achieved significantly better speech quality and comparable speaker similarity. The converted samples are available online for listening. 1

show abstract

Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization

Cited by 8 publications

References 17 publications

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

Noise-robust voice conversion with domain adversarial training

Non-Parallel Many-To-Many Voice Conversion Using Local Linguistic Tokens

Contact Info

Product

Resources

About