Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation

Choi, Woosung; Kim, Minseok; Chung, Jaehwa; Jung, Soonyoung

doi:10.1109/icassp39728.2021.9413896

Cited by 27 publications

(34 citation statements)

References 12 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Choi et al [161] incorporated the idea of source-based conditioning [192] in the time-distributed U-Net framework [193] such that a source-attentive frequency transformation block can capture the source-dependent frequency patterns. They also proposed a Gated Point-wise Convolutional Modulation layer that extends the concept of FiLM layer [192] by incorporating inter-channel interactions.…”

Section: B Content-informed Data-driven Methodsmentioning

confidence: 99%

Deep Learning Approaches in Topics of Singing Information Processing

Gupta

Goto

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Singing, the vocal production of musical tones, is one of the most important elements of music. Addressing the needs of real-world applications, the study of technologies related to singing voices has become an increasingly active area of research. In this paper, we provide a comprehensive overview of the recent developments in the field of singing information processing, specifically in the topics of singing skill evaluation, singing voice synthesis, singing voice separation, and lyrics synchronization and transcription. We will especially focus on deep learning approaches including modern representation learning techniques for singing voices. We will also provide an overview of contributions in public datasets for singing voice research.

show abstract

Section: B Content-informed Data-driven Methodsmentioning

confidence: 99%

Deep Learning Approaches in Topics of Singing Information Processing

Gupta

Goto

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Recently, there has been increased interest in TSE applications to speech [17], [38]- [41], music [15], [16], [18], [19], [42]- [46], and universal sounds [2], [11]- [14], [47], [48]. Various types of auxiliary clues have been proposed to identify the target in a sound mixture, including enrollment audio samples [12], [18], [19], [38], [39], [47], class labels [2], [11], [45], video signals of the target source [15], [42], [48], [49], and recently even onomatopoeia [14].…”

Section: B Target Sound Extractionmentioning

confidence: 99%

SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

Delcroix¹,

Vázquez²,

Ochiai³

et al. 2022

Preprint

View full text Add to dashboard Cite

In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) aims at tackling this problem by estimating the sound of target SE classes in a mixture while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing the target SE classes. Two types of clues have been proposed, i.e., target SE class labels and enrollment sound samples similar to the target sound. Systems based on SE class labels can directly optimize embedding vectors representing the SE classes, resulting in high extraction performance. However, extending these systems to the extraction of new SE classes not encountered during training is not easy. Enrollment-based approaches extract SEs by finding sounds in the mixtures that share similar characteristics to the enrollment. These approaches do not explicitly rely on SE class definitions and can thus handle new SE classes. In this paper, we introduce a TSE framework, SoundBeam, that combines the advantages of both approaches. We also perform an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of SoundBeam.

show abstract

“…We base our separation model on the well-studied U-Net architecture [4,11,21] in order to complement related work on conditioned source separation [10,11,15]. While models with better separation performance exist [13], the purpose of this study is to explore the effects of different types of conditioning approaches and design choices, which can generally translate to better performance, rather than to strive for state-of-the-art results.…”

Section: Model Configurationsmentioning

confidence: 99%

“…In order to separate more than one instrument, more than one model is required. More recent models have aimed to overcome this limitation and support the separation of various instruments using a single model via instrument class conditioning mechanisms [10][11][12][13][14]. However, these models are still limited to the instrument classes that the models were trained on and do not generalize to unseen instruments.…”

Section: Introductionmentioning

confidence: 99%

Few-Shot Musical Source Separation

Wang¹,

Stoller²,

Bittner³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning-based approaches to musical source separation are often limited to the instrument classes that the models are trained on and do not generalize to separate unseen instruments. To address this, we propose a few-shot musical source separation paradigm. We condition a generic U-Net source separation model using few audio examples of the target instrument. We train a few-shot conditioning encoder jointly with the U-Net to encode the audio examples into a conditioning vector to configure the U-Net via feature-wise linear modulation (FiLM). We evaluate the trained models on real musical recordings in the MUSDB18 and MedleyDB datasets. We show that our proposed few-shot conditioning paradigm outperforms the baseline one-hot instrument-class conditioned model for both seen and unseen instruments. To extend the scope of our approach to a wider variety of real-world scenarios, we also experiment with different conditioning example characteristics, including examples from different recordings, with multiple sources, or negative conditioning examples.

show abstract

Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation

Cited by 27 publications

References 12 publications

Deep Learning Approaches in Topics of Singing Information Processing

Deep Learning Approaches in Topics of Singing Information Processing

SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

Few-Shot Musical Source Separation

Contact Info

Product

Resources

About