2023
DOI: 10.1109/taslp.2022.3221000
|View full text |Cite
|
Sign up to set email alerts
|

SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(5 citation statements)
references
References 49 publications
0
5
0
Order By: Relevance
“…Compared with speech systems, this is an underexplored problem in the audio machine learning community. However recent works have proposed neural networks that can achieve target sound extraction where clues about the target sound are provided either via audio [15,21], images [19,69], text [33,35], onomatopoeic words [46], or a one-hot vectors [45]. All these models are designed for offline processing of audio clips, where the neural network has access to the entire audio file (≥ 1 s) and hence cannot support our real-time hearable use-case.…”
Section: ]mentioning
confidence: 99%
“…Compared with speech systems, this is an underexplored problem in the audio machine learning community. However recent works have proposed neural networks that can achieve target sound extraction where clues about the target sound are provided either via audio [15,21], images [19,69], text [33,35], onomatopoeic words [46], or a one-hot vectors [45]. All these models are designed for offline processing of audio clips, where the neural network has access to the entire audio file (≥ 1 s) and hence cannot support our real-time hearable use-case.…”
Section: ]mentioning
confidence: 99%
“…Although acquiring a label query is effortless compared to the audio or visual query, the label set is often pre-defined and adheres to a finite set of source categories. This imposes a challenge when attempting to generalize the separation system into an opendomain scenario, which often requires re-training the sound separation model or using complicated methods such as continual learning [20], [32]. In addition, label information lacks the capability to describe the relationship between multiple sound events such as their spatial relation and temporal order.…”
Section: B Query-based Sound Separationmentioning
confidence: 99%
“…We utilize signal-to-distortion ratio improvement (SDRi) [15], [20] and scale-invariant SDR (SI-SDR) [72] to evaluate the performance of sound separation tasks. For the speech enhancement task, following previous works [6], [7], [70], [71], we apply the Perceptual evaluation of speech quality (PESQ) [73], Mean opinion score (MOS) predictor of signal distortion (CSIG), MOS predictor of background-noise intrusiveness (CBAK), MOS predictor of overall signal quality (COVL) [74] and segmental signal-to-ratio noise (SSNR) [75] for evaluation.…”
Section: Evaluation Metricsmentioning
confidence: 99%
“…Specifically, the target sound embedding is extracted from the reference audio using a separate model-based target sound encoder and is subsequently provided to the SS model as a conditional input. Similarly, SoundBeam [23] extended SoundFilter [22] such that multiple reference audio signals could be utilized for target sound enrollment, and further employed the target sound embedding table in addition to the model-based target sound encoder. As a shared embedding space was learned for these two embedding schemes, the enrollment of a new target sound is allowed by fine-tuning the two embedding modules to produce two similar embeddings in the shared embedding space [23].…”
Section: Introductionmentioning
confidence: 99%
“…Similarly, SoundBeam [23] extended SoundFilter [22] such that multiple reference audio signals could be utilized for target sound enrollment, and further employed the target sound embedding table in addition to the model-based target sound encoder. As a shared embedding space was learned for these two embedding schemes, the enrollment of a new target sound is allowed by fine-tuning the two embedding modules to produce two similar embeddings in the shared embedding space [23]. In addition to target sound embeddings, other informative clues, such as the timestamp information of an SE [24] or the location of the sound source [25], have been employed for TSS.…”
Section: Introductionmentioning
confidence: 99%