SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning

Delcroix, Marc; Vázquez, Jorge Bennasar; Ochiai, Tsubasa; Kinoshita, Keisuke; Ohishi, Yasunori; Araki, Shoko

doi:10.1109/taslp.2022.3221000

Cited by 14 publications

(5 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with speech systems, this is an underexplored problem in the audio machine learning community. However recent works have proposed neural networks that can achieve target sound extraction where clues about the target sound are provided either via audio [15,21], images [19,69], text [33,35], onomatopoeic words [46], or a one-hot vectors [45]. All these models are designed for offline processing of audio clips, where the neural network has access to the entire audio file (≥ 1 s) and hence cannot support our real-time hearable use-case.…”

Section: ]mentioning

confidence: 99%

Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Veluri,

Itani,

Chan

et al. 2023

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

View full text Add to dashboard Cite

show abstract

Section: ]mentioning

confidence: 99%

Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Veluri,

Itani,

Chan

et al. 2023

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

View full text Add to dashboard Cite

show abstract

“…Although acquiring a label query is effortless compared to the audio or visual query, the label set is often pre-defined and adheres to a finite set of source categories. This imposes a challenge when attempting to generalize the separation system into an opendomain scenario, which often requires re-training the sound separation model or using complicated methods such as continual learning [20], [32]. In addition, label information lacks the capability to describe the relationship between multiple sound events such as their spatial relation and temporal order.…”

Section: B Query-based Sound Separationmentioning

confidence: 99%

“…We utilize signal-to-distortion ratio improvement (SDRi) [15], [20] and scale-invariant SDR (SI-SDR) [72] to evaluate the performance of sound separation tasks. For the speech enhancement task, following previous works [6], [7], [70], [71], we apply the Perceptual evaluation of speech quality (PESQ) [73], Mean opinion score (MOS) predictor of signal distortion (CSIG), MOS predictor of background-noise intrusiveness (CBAK), MOS predictor of overall signal quality (COVL) [74] and segmental signal-to-ratio noise (SSNR) [75] for evaluation.…”

Section: Evaluation Metricsmentioning

confidence: 99%

Separate What You Describe: Language-Queried Audio Source Separation

Liu¹,

Liu²,

Kong³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

show abstract

“…Specifically, the target sound embedding is extracted from the reference audio using a separate model-based target sound encoder and is subsequently provided to the SS model as a conditional input. Similarly, SoundBeam [23] extended SoundFilter [22] such that multiple reference audio signals could be utilized for target sound enrollment, and further employed the target sound embedding table in addition to the model-based target sound encoder. As a shared embedding space was learned for these two embedding schemes, the enrollment of a new target sound is allowed by fine-tuning the two embedding modules to produce two similar embeddings in the shared embedding space [23].…”

Section: Introductionmentioning

confidence: 99%

“…Similarly, SoundBeam [23] extended SoundFilter [22] such that multiple reference audio signals could be utilized for target sound enrollment, and further employed the target sound embedding table in addition to the model-based target sound encoder. As a shared embedding space was learned for these two embedding schemes, the enrollment of a new target sound is allowed by fine-tuning the two embedding modules to produce two similar embeddings in the shared embedding space [23]. In addition to target sound embeddings, other informative clues, such as the timestamp information of an SE [24] or the location of the sound source [25], have been employed for TSS.…”

Section: Introductionmentioning

confidence: 99%

Acoustic-Scene-Aware Target Sound Separation With Sound Embedding Refinement

Kim,

Chang

2024

IEEE Access

View full text Add to dashboard Cite

Target sound separation (TSS) aims to separate specific sounds of interest, like a speech or a musical instrument, from complex acoustic environments with multiple overlapping sounds. In realistic scenarios, the important sounds that we want to hear can differ depending on transitions in the surrounding acoustic scene. This study addresses the problem of acoustic-scene-aware TSS, which separates predefined sets of target sounds considered significant for the current acoustic environment. Predefined sets of target sounds were determined beforehand based on the expected acoustic scenes. For example, the sound of a bicycle bell is predefined as the target sound in a park scene and separated from a mixture of various sounds. As a solution, we propose a novel approach called Acoustic-SCene-Aware Target sound separation with sound Embedding Refinement (SCATER). It refines pre-trained sound embeddings into acoustic-sceneaware representations to guide the separation of specific target sounds based on the surrounding scene. SCATER adopts a multiple instance learning-based acoustic scene classification system for rapid response to scene changes. The refined sound embeddings serve as cues for the TSS model, enabling the separation of different target sounds across various acoustic scenes. Experimental results demonstrate the superiority of SCATER over an approach that combines sound separation and scene classification separately.

show abstract

SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning

Cited by 14 publications

References 49 publications

Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Separate What You Describe: Language-Queried Audio Source Separation

Acoustic-Scene-Aware Target Sound Separation With Sound Embedding Refinement

Contact Info

Product

Resources

About