Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10894
|View full text |Cite
|
Sign up to set email alerts
|

Separate What You Describe: Language-Queried Audio Source Separation

Abstract: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 29 publications
(12 citation statements)
references
References 48 publications
0
12
0
Order By: Relevance
“…Recently, the problem was extended to the extraction of arbitrary sounds from a mixture [66], [67], e.g., extracting the sound of a siren or a klaxon from a recording of a mixture of street sounds. We can use such systems as that introduced in Section IV to tackle these problems, where the clue can be a class label indicating the type of target sound [66], the enrollment audio of a similar target sound [67], a video of the sound source [9] or a text description of the target sound [68]. Target sound extraction may become an important technology to design, e.g., hearables or hearing aids that could filter out nuisances and emphasize important sounds in our surroundings, or audio visual scene analysis [9].…”
Section: Beyond Speechmentioning
confidence: 99%
“…Recently, the problem was extended to the extraction of arbitrary sounds from a mixture [66], [67], e.g., extracting the sound of a siren or a klaxon from a recording of a mixture of street sounds. We can use such systems as that introduced in Section IV to tackle these problems, where the clue can be a class label indicating the type of target sound [66], the enrollment audio of a similar target sound [67], a video of the sound source [9] or a text description of the target sound [68]. Target sound extraction may become an important technology to design, e.g., hearables or hearing aids that could filter out nuisances and emphasize important sounds in our surroundings, or audio visual scene analysis [9].…”
Section: Beyond Speechmentioning
confidence: 99%
“…Compared with flourishing research on VL multimodal learning, research on audio-language multimodal learning is limited due to the lack of AL datasets. Almost all AL tasks, such as automated audio captioning [10], [47], [48], languagebased audio retrieval [8], [9], text-to-audio generation [15], [16] and language-queried sound separation [19], rely on two main audio captioning datasets, AudioCaps [38] and Clotho [43]. AudioCaps contains about 50k audio clips sourced from AudioSet [1], the largest audio event dataset, and is annotated by humans.…”
Section: B Audio-language Datasetsmentioning
confidence: 99%
“…More recently, there has been a surge of interest in establishing a more profound comprehension of audio content by connecting audio and language. A number of audio-language (AL) multimodal learning tasks have been introduced, such as textto-audio retrieval [8], [9], automated audio captioning [10]- [12], audio question answering [13], [14], text-based sound generation [15]- [18], and text-based sound separation [19].…”
Section: Introductionmentioning
confidence: 99%
“…One more text-to-audio solution is AudioLDM, proposed by [9]. The authors emphasize higher generation quality with less expensive computing.…”
Section: Literature Reviewmentioning
confidence: 99%