2022
DOI: 10.1609/aaai.v36i4.20366
|View full text |Cite
|
Sign up to set email alerts
|

Zero-Shot Audio Source Separation through Query-Based Learning from Weakly-Labeled Data

Abstract: Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 18 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…These methods separate the target sound conditioned on the average audio embedding of a few audio examples of the target source, which requires labeled single sources for training. The work in [15], [16] proposes to train the audio-queried sound separation system with large-scale weakly labeled data (e.g., AudioSet [30]), which is achieved by first using a sound event detection model [31] to detect the anchor segment of sound events which are further used to constitute the artificial mixtures for the training of audio-queried sound separation models. These audio-queried sound separation approaches have shown great potential in the separation of unseen sound sources.…”
Section: B Query-based Sound Separationmentioning
confidence: 99%
“…These methods separate the target sound conditioned on the average audio embedding of a few audio examples of the target source, which requires labeled single sources for training. The work in [15], [16] proposes to train the audio-queried sound separation system with large-scale weakly labeled data (e.g., AudioSet [30]), which is achieved by first using a sound event detection model [31] to detect the anchor segment of sound events which are further used to constitute the artificial mixtures for the training of audio-queried sound separation models. These audio-queried sound separation approaches have shown great potential in the separation of unseen sound sources.…”
Section: B Query-based Sound Separationmentioning
confidence: 99%
“…Several works have utilized VAE models for exploring audio representations, including VAE models for finding disentangled audio representations (Luo, Agres, & Herremans, 2019) and VAE models for modeling audio containing speech (Hsu, Zhang, & Glass, 2017). Apart from VAE, simple convolutional models are used in different music tasks such as music recommendation (Chen, Liang, Ma, & Gu, 2021) and source separation (Chen, Du, et al, 2021). In our experiments for audio representations, we chose to use a convolutional VAE to map the short-time Fourier transform (STFT) representation of audio into a lower dimensional latent representation.…”
Section: Vae For Audiomentioning
confidence: 99%
“…Unlike traditional MSS methods, which mainly focus on the acoustic characteristics of sound sources for studying songs * School of Music, Henan Vocational Institute of Arts, Zhengzhou, 451464, China (unwenzhy1234@163.com) and instruments in mixed audio, in recent years, the growth of big data and computing technology has led to the introduction of more modern computing technologies into the field of audio processing, and has also brought new ideas to MSS [5][6]. Chen et al addressed the problem of poor separation performance of multi-source mixed audio sources by utilizing three sets of channels to train the audio set based on deep learning, thereby improving separation performance and expanding its generalization [7]. Huang et al proposed a compatible prediction model based on self supervised and semi supervised learning to address the issue of low prediction performance of audio element compatibility in current mixed audio MSS.…”
Section: Introductionmentioning
confidence: 99%