2021
DOI: 10.48550/arxiv.2104.09946
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A cappella: Audio-visual Singing Voice Separation

Abstract: Music source separation can be interpreted as the estimation of the constituent music sources that a music clip is composed of. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 34 publications
0
5
0
Order By: Relevance
“…AV-CVAE [64] is an audio-visual speech separation model based on VAE [42], which detects the lip movements of the speaker and predicts the individual separated speech audio. Acapella [61] is for audio-visual singing separation. The architecture is a two-stream CNN, which is called Y-Net for processing audio and video respectively.…”
Section: Audio-visual Speech Separationmentioning
confidence: 99%
See 2 more Smart Citations
“…AV-CVAE [64] is an audio-visual speech separation model based on VAE [42], which detects the lip movements of the speaker and predicts the individual separated speech audio. Acapella [61] is for audio-visual singing separation. The architecture is a two-stream CNN, which is called Y-Net for processing audio and video respectively.…”
Section: Audio-visual Speech Separationmentioning
confidence: 99%
“…Lipreading WLAS [15],LiRA [56] Vid2Speech [24],Lipnet [3] Ephrat et al [25] Chung et al [15] AV speech separation AV-CVAE [64] VisualSpeech [29] Acapella [61],Facefilter [16] Ephrat et al [26] Afouras et al [1] Gabbay et al [27] Talking face generation DAVS [110],ATVGnet [8] Chung et al [14] Zhu et al [112] Song et al [86] Chen et al [7] Sound generation I2S [6],REGNET [9] Foley Music [28] Zhou et al [111] Figure 4: A taxonomy of audio-visual multimodal processing.…”
Section: Av Multimodal Processingmentioning
confidence: 99%
See 1 more Smart Citation
“…The results display that the approach can generalise well in real-world scenarios. Finally, Montesinos et al [16] proposed AV convolutional network based on graphs to separate singing voice by exploiting acoustic and visual information. The experimental results show the proposed approach outperforms state-of-the-art A-only singing voice separation approaches.…”
Section: Clean Imagesmentioning
confidence: 99%
“…In the literature, extensive studies have been carried out to develop AV SE methods in time-domain and frequency domain [4,5,6,7,8,9,10,11,12,13,14,15,16,17]. However, despite significant research in the area of AV SE, real-time processing models, with no or low latency (8-12 ms) remains a formidable technical challenge.…”
Section: Introductionmentioning
confidence: 99%