One-Shot Conditional Audio Filtering of Arbitrary Sounds

Gfeller, Beat; Roblek, Dominik; Tagliasacchi, Marco

doi:10.1109/icassp39728.2021.9414003

Cited by 20 publications

(33 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are different ways to exploit the embedding vector within the extraction network [2,8,13]. We use here an elementwise multiplication ( operator in Fig.…”

Section: Soundbeammentioning

confidence: 99%

“…The embedding encoder computes the embedding vector e. There are two ways of estimating the embedding vector: using a 1-hot vector [7,9] or an enrollment audio sample [8]. We describe these two approaches in the following subsections.…”

Section: Soundbeammentioning

confidence: 99%

“…The embedding-based approach does not directly optimize the embedding vector for the AE classes, which can result in lower performance for seen AE classes. However, the method can naturally handle new AE classes, when we provide an enrollment sample with similar sound characteristics as the target, and if the system has been trained with a sufficient variety of AE sounds [8].…”

Section: Enrollment-based Soundbeammentioning

confidence: 99%

“…It has been a long-standing goal of researchers to reproduce human listening capabilities. Recently, neural network-based target sound extraction has received increased interest as a promising approach towards this goal, with methods developed to extract speech of a target speaker [1][2][3][4], music instruments [5,6] or AE sounds [7][8][9]. In this paper, we focus on the AE sound extraction problem, which is particularly challenging given the large variety of sounds it covers (e.g.…”

Section: Introductionmentioning

confidence: 99%

“…It uses an extraction neural network that estimates the target AE sound given the sound mixture and an embedding vector that represents the characteristics of the target sound. The embedding vector can be obtained using an embedding encoder that receives either (1) an enrollment audio sample that is similar to the target AE sound [8] or, (2) a 1-hot vector that represents the target AE class [7,9]. The extraction neural network and the embedding encoders are jointly trained.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Few-Shot Learning of New Sound Classes for Target Sound Extraction

Delcroix

Vázquez²,

Ochiai

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen during training. However, it is not easy to extend this framework to new AE classes, i.e. unseen during training. Recently, speech, music, or AE sound extraction based on enrollment audio of the desired sound offers the potential of extracting any target sound in a mixture given only a short audio signal of a similar sound. In this work, we propose combining 1-hot-and enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes. In experiments with synthesized sound mixtures generated with the Freesound Dataset (FSD) datasets, we demonstrate the benefit of the combined framework for both seen and new AE classes. Besides, we also propose adapting the embedding vectors obtained from a few enrollment audio samples (few-shot) to further improve performance on new classes.

show abstract

“…There are different ways to exploit the embedding vector within the extraction network [2,8,13]. We use here an elementwise multiplication ( operator in Fig.…”

Section: Soundbeammentioning

confidence: 99%

Section: Soundbeammentioning

confidence: 99%

Section: Enrollment-based Soundbeammentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Few-Shot Learning of New Sound Classes for Target Sound Extraction

Delcroix

Vázquez²,

Ochiai

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Tzinis

Wisdom

Remez

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Acoustic-Scene-Aware Target Sound Separation With Sound Embedding Refinement

Kim,

Chang

2024

IEEE Access

View full text Add to dashboard Cite

Target sound separation (TSS) aims to separate specific sounds of interest, like a speech or a musical instrument, from complex acoustic environments with multiple overlapping sounds. In realistic scenarios, the important sounds that we want to hear can differ depending on transitions in the surrounding acoustic scene. This study addresses the problem of acoustic-scene-aware TSS, which separates predefined sets of target sounds considered significant for the current acoustic environment. Predefined sets of target sounds were determined beforehand based on the expected acoustic scenes. For example, the sound of a bicycle bell is predefined as the target sound in a park scene and separated from a mixture of various sounds. As a solution, we propose a novel approach called Acoustic-SCene-Aware Target sound separation with sound Embedding Refinement (SCATER). It refines pre-trained sound embeddings into acoustic-sceneaware representations to guide the separation of specific target sounds based on the surrounding scene. SCATER adopts a multiple instance learning-based acoustic scene classification system for rapid response to scene changes. The refined sound embeddings serve as cues for the TSS model, enabling the separation of different target sounds across various acoustic scenes. Experimental results demonstrate the superiority of SCATER over an approach that combines sound separation and scene classification separately.

show abstract

One-Shot Conditional Audio Filtering of Arbitrary Sounds

Cited by 20 publications

References 23 publications

Few-Shot Learning of New Sound Classes for Target Sound Extraction

Few-Shot Learning of New Sound Classes for Target Sound Extraction

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Acoustic-Scene-Aware Target Sound Separation With Sound Embedding Refinement

Contact Info

Product

Resources

About