Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-194
|View full text |Cite
|
Sign up to set email alerts
|

SpeechPainter: Text-conditioned Speech Inpainting

Abstract: We present SoundStorm, a model for efficient, non-autoregressive audio generation. Sound-Storm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Also worth mentioning are other recent works that have applied multi-modal side information as a conditioner for the inpainting algorithm, including video frames [36], symbolic music [37,19], or text [38,39]. Although this idea falls outside the scope of this paper, exploiting multi-modal information may turn out to be beneficial to inpaint large gaps, where the context of the gap does not contain enough information to reconstruct the missing segment.…”
Section: Deep-learning-based Audio Inpaintingmentioning
confidence: 99%
“…Also worth mentioning are other recent works that have applied multi-modal side information as a conditioner for the inpainting algorithm, including video frames [36], symbolic music [37,19], or text [38,39]. Although this idea falls outside the scope of this paper, exploiting multi-modal information may turn out to be beneficial to inpaint large gaps, where the context of the gap does not contain enough information to reconstruct the missing segment.…”
Section: Deep-learning-based Audio Inpaintingmentioning
confidence: 99%
“…While earlier approches to musical audio generation were limited in terms of producing high quality outputs (Dhariwal et al, 2020) or semantically consistent long audios (Hawthorne et al, 2022), recent research has achieved a level of quality that allows for an enjoyable listening experience. A first line of work casts the task of music generation as categorical prediction in the discrete token space provided by a neural audio codec Zeghidour et al, 2022), and trains a Transformer-based (Vaswani et al, 2017) model for next token prediction (Borsos et al, 2023a) or parallel token decoding (Borsos et al, 2023b;Garcia et al, 2023;Parker et al, 2024).…”
Section: Related Workmentioning
confidence: 99%
“…We address this with a hierarchical transformer, similarly to Lee et al (2022); Yang et al (2023); . Finally, instead of the original autoregressive fine acoustic modelling stage of MusicLM, we use Soundstorm (Borsos et al, 2023b) for achieving efficient parallel generation.…”
Section: Musiclmmentioning
confidence: 99%
“…Bai et al [9] suggests an alignment-aware acoustic and text pretraining method, which can be directly applied to speech editing by reconstructing masked acoustic signals through text input and acoustic text alignment. What's more, SpeechPainter [10] leverages an auxiliary textual input to fill in gaps of up to one second in speech samples and generalize it to unseen speakers. However, when applied to speech editing, all the existing methods [6,7,8,9] based on neural networks do partial inference instead of entire inference, as shown in Figure 1(a).…”
Section: Introductionmentioning
confidence: 99%