Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3232
|View full text |Cite
|
Sign up to set email alerts
|

VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

Abstract: We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker's voice. Moreover, the system should also balance the discrimination score ABX, the bit-rate compression rate, and the naturalness and the intelligibility of the constructed voice. To tac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
57
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 57 publications
(57 citation statements)
references
References 26 publications
0
57
0
Order By: Relevance
“…Since the speech utterances for the sentences are unavailable, we generated sentences with Google text-to-speech API for all languages pairs. Even though the lack of natural speech dataset in this paper, VQ-VAE and codebook inverter can be applied and has shown a great performance on multispeaker natural speech [14,13]. Some papers [30,31,32] also show the performance improvement from the synthetic dataset can be carried over to the real dataset.…”
Section: Datasetmentioning
confidence: 97%
“…Since the speech utterances for the sentences are unavailable, we generated sentences with Google text-to-speech API for all languages pairs. Even though the lack of natural speech dataset in this paper, VQ-VAE and codebook inverter can be applied and has shown a great performance on multispeaker natural speech [14,13]. Some papers [30,31,32] also show the performance improvement from the synthetic dataset can be carried over to the real dataset.…”
Section: Datasetmentioning
confidence: 97%
“…Machines can directly get discrete segments by applying such clustering algorithms as K-means [12], [11], GMM [11], or DPGMM clustering [13], [1], [14] from the acoustic features. The DPGMM algorithm [28] retained the state-ofthe-art approach in the Zerospeech 2015 and 2017 [15], [16].…”
Section: B Dpgmm-rnn Model and Phoneme Categorizationmentioning
confidence: 99%
“…However, low-dimensional continuous features are never as efficient as discrete features or discrete segments. The Vector Quantised-Variational AutoEncoder (VQ-VAE) can quantize speech acoustic features [11].…”
Section: Functional Load and Economical Principlementioning
confidence: 99%
See 1 more Smart Citation
“…For the conventional Gaussian mixture model (GMM) approach, non-parallel VC can be adapted from a pretrained parallel VC in the model space using the maximum a posterior (MAP) method [7,8] or as interpolation between multiple parallel models [9]. For recent neural network approaches, a non-parallel VC can be trained by directly using an intermediate linguistic representation extracted from an automatic speech recognition (ASR) model [10,11] or by indirectly encouraging the network to disentangle linguistic information from the speaker characteristics using methods like variational autoencoder (VAE) [12], generative adversarial networks (GAN) [13,14] or some other techniques [15]. For both parallel and non-parallel VC, the systems usually change the voice but are unable to change the duration of the utterance.…”
Section: Introductionmentioning
confidence: 99%