2022
DOI: 10.1109/taslp.2021.3129994
|View full text |Cite
|
Sign up to set email alerts
|

SoundStream: An End-to-End Neural Audio Codec

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
132
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 209 publications
(133 citation statements)
references
References 40 publications
0
132
1
Order By: Relevance
“…In our experiments, we use a neural vocoder to perform the audio synthesis from log-mel spectrograms. The architecture of the vocoder is identical to MelGAN [12], but it is trained with a multi-scale reconstruction loss as well as adversarial losses from both wave-and STFT-based discriminators, following [22].…”
Section: Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…In our experiments, we use a neural vocoder to perform the audio synthesis from log-mel spectrograms. The architecture of the vocoder is identical to MelGAN [12], but it is trained with a multi-scale reconstruction loss as well as adversarial losses from both wave-and STFT-based discriminators, following [22].…”
Section: Modelmentioning
confidence: 99%
“…At the same time, we optimize G to produce outputs that are indistinguishable from the ground truth by matching the feature representations in all layers of D [13,12]. D is a convolutional network closely resembling the single-scale STFT discriminator of [22], shown in Figure 2 -for further details about the architecture, we refer the reader to [22]. The discriminator is trained with the hinge loss,…”
Section: Two-phase Trainingmentioning
confidence: 99%
“…(Agustsson et al, 2019;Mentzer et al, 2020). Neural compression has also been applied to video (Lu et al, 2019;Goliński et al, 2020;Agustsson et al, 2020) and audio (Kleijn et al, 2018;Valin & Skoglund, 2019;Yang et al, 2019;Zeghidour et al, 2021).…”
Section: Entropy Codingmentioning
confidence: 99%
“…The model architecture is based on the decoder described in [27,28], which is a real-time streaming-capable version of MelGAN [29]. Its structure with parameters is shown in Table 1.…”
Section: Neural Network Based Spectrogram Inversionmentioning
confidence: 99%
“…We train the neural vocoder with the same mix of losses used in [28] to achieve both signal reconstruction fidelity and perceptual quality, following the principles of the perceptiondistortion trade-off discussed in [31]. The adversarial loss is used to promote perceptual quality and it is defined as a hinge loss over the logits of the discriminator, averaged over multiple discriminators and over time, operating both in the time domain and in the STFT domain.…”
Section: Neural Network Based Spectrogram Inversionmentioning
confidence: 99%