Text-Free Prosody-Aware Generative Spoken Language Modeling

Kharitonov, Eugene; Lee, Ann; Polyak, Adam; Adi, Yossi; Copet, Jade; Lakhotia, Kushal; Nguyen, Tu Anh; Rivière, Morgane; Mohamed, Abdelrahman; Dupoux, Emmanuel; Hsu, Wei-Ning

doi:10.18653/v1/2022.acl-long.593

Cited by 22 publications

(7 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In GSLM, a large autoregressive language model is typically trained on discovered discrete units (e.g. HuBERT [18] clusters or clustered spectrogram features), similar to how a language model is trained on text [19], [20]. While this also enables the generation of speech without any conditioning input, GSLM implies a model structure consisting of an encoder to discretize speech, a language model, and a decoder to convert the discrete units back into a waveform [17].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Kamper

2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/.

show abstract

Section: Related Workmentioning

confidence: 99%

“…it is not possible to interpolate between two utterances in a latent space or to directly control speaker characteristics during generation. If this is desired, additional components must be explicitly built into the model [20].…”

Section: Related Workmentioning

confidence: 99%

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Kamper

2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…The text is synthesized into speech. A spoken language model can be made to generate spoken language directly, as demonstrated by [67], [100], [126], [127]. Much as Task 3 is complementary to Task 1-but has slightly different constraints-the task of generating speech from a spoken language model is complementary to Task 4, yielding a potential Task 5.…”

Section: The Future Of the Zero Resource Speech Challengementioning

confidence: 99%

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Dunbar

Hamilakis²,

Dupoux³

2022

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks-Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling-and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

show abstract

“…Thus, they involve another phase of converting from the spectral domain to the time domain using a vocoder. Moreover, using discrete self-supervised speech representations and generating waveforms from these was demonstrated to provide superior performance on plenty of downstream tasks such as speech and audio language modelling Borsos et al, 2022;Qian et al, 2022), multi-stream processing (Kharitonov et al, 2022b), speech emotion conversion (Kreuk et al, 2021), spoken dialogue (Nguyen et al, 2022), speech-tospeech translation (Lee et al, 2022a,b;Popuri et al, 2022), and audio generation (Kreuk et al, 2022a,b).…”

Section: Introductionmentioning

confidence: 99%

Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units

Maimon,

Adi

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines. Code and samples are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.

show abstract

Text-Free Prosody-Aware Generative Spoken Language Modeling

Cited by 22 publications

References 31 publications

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units

Contact Info

Product

Resources

About