ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413896
|View full text |Cite
|
Sign up to set email alerts
|

Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation

Abstract: Recent deep-learning approaches have shown that Frequency Transformation (FT) blocks can significantly improve spectrogram-based single-source separation models by capturing frequency patterns. The goal of this paper is to extend the FT block to fit the multi-source task. We propose the Latent Source Attentive Frequency Transformation (LaSAFT) block to capture source-dependent frequency patterns. We also propose the Gated Point-wise Convolutional Modulation (GPoCM), an extension of Feature-wise Linear Modulati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(34 citation statements)
references
References 12 publications
(23 reference statements)
0
33
0
Order By: Relevance
“…Choi et al [161] incorporated the idea of source-based conditioning [192] in the time-distributed U-Net framework [193] such that a source-attentive frequency transformation block can capture the source-dependent frequency patterns. They also proposed a Gated Point-wise Convolutional Modulation layer that extends the concept of FiLM layer [192] by incorporating inter-channel interactions.…”
Section: B Content-informed Data-driven Methodsmentioning
confidence: 99%
“…Choi et al [161] incorporated the idea of source-based conditioning [192] in the time-distributed U-Net framework [193] such that a source-attentive frequency transformation block can capture the source-dependent frequency patterns. They also proposed a Gated Point-wise Convolutional Modulation layer that extends the concept of FiLM layer [192] by incorporating inter-channel interactions.…”
Section: B Content-informed Data-driven Methodsmentioning
confidence: 99%
“…Recently, there has been increased interest in TSE applications to speech [17], [38]- [41], music [15], [16], [18], [19], [42]- [46], and universal sounds [2], [11]- [14], [47], [48]. Various types of auxiliary clues have been proposed to identify the target in a sound mixture, including enrollment audio samples [12], [18], [19], [38], [39], [47], class labels [2], [11], [45], video signals of the target source [15], [42], [48], [49], and recently even onomatopoeia [14].…”
Section: B Target Sound Extractionmentioning
confidence: 99%
“…We base our separation model on the well-studied U-Net architecture [4,11,21] in order to complement related work on conditioned source separation [10,11,15]. While models with better separation performance exist [13], the purpose of this study is to explore the effects of different types of conditioning approaches and design choices, which can generally translate to better performance, rather than to strive for state-of-the-art results.…”
Section: Model Configurationsmentioning
confidence: 99%
“…In order to separate more than one instrument, more than one model is required. More recent models have aimed to overcome this limitation and support the separation of various instruments using a single model via instrument class conditioning mechanisms [10][11][12][13][14]. However, these models are still limited to the instrument classes that the models were trained on and do not generalize to unseen instruments.…”
Section: Introductionmentioning
confidence: 99%