Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.593
|View full text |Cite
|
Sign up to set email alerts
|

Text-Free Prosody-Aware Generative Spoken Language Modeling

Abstract: Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pretraining, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, desp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 22 publications
(7 citation statements)
references
References 31 publications
0
7
0
Order By: Relevance
“…In GSLM, a large autoregressive language model is typically trained on discovered discrete units (e.g. HuBERT [18] clusters or clustered spectrogram features), similar to how a language model is trained on text [19], [20]. While this also enables the generation of speech without any conditioning input, GSLM implies a model structure consisting of an encoder to discretize speech, a language model, and a decoder to convert the discrete units back into a waveform [17].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In GSLM, a large autoregressive language model is typically trained on discovered discrete units (e.g. HuBERT [18] clusters or clustered spectrogram features), similar to how a language model is trained on text [19], [20]. While this also enables the generation of speech without any conditioning input, GSLM implies a model structure consisting of an encoder to discretize speech, a language model, and a decoder to convert the discrete units back into a waveform [17].…”
Section: Related Workmentioning
confidence: 99%
“…it is not possible to interpolate between two utterances in a latent space or to directly control speaker characteristics during generation. If this is desired, additional components must be explicitly built into the model [20].…”
Section: Related Workmentioning
confidence: 99%
“…The text is synthesized into speech. A spoken language model can be made to generate spoken language directly, as demonstrated by [67], [100], [126], [127]. Much as Task 3 is complementary to Task 1-but has slightly different constraints-the task of generating speech from a spoken language model is complementary to Task 4, yielding a potential Task 5.…”
Section: The Future Of the Zero Resource Speech Challengementioning
confidence: 99%
“…Thus, they involve another phase of converting from the spectral domain to the time domain using a vocoder. Moreover, using discrete self-supervised speech representations and generating waveforms from these was demonstrated to provide superior performance on plenty of downstream tasks such as speech and audio language modelling Borsos et al, 2022;Qian et al, 2022), multi-stream processing (Kharitonov et al, 2022b), speech emotion conversion (Kreuk et al, 2021), spoken dialogue (Nguyen et al, 2022), speech-tospeech translation (Lee et al, 2022a,b;Popuri et al, 2022), and audio generation (Kreuk et al, 2022a,b).…”
Section: Introductionmentioning
confidence: 99%