Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

Zhu, Lingyu; Rahtu, Esa

doi:10.1007/978-3-030-69544-6_25

Cited by 17 publications

(45 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, PixelPlayer only considered semantic features extracted from the video frames. Appearance information is important as highlighted in [290], where the separation was guided with a single image, but higher performance is expected to be achieved when also motion information is exploited. Zhao et al [287] proposed to combine trajectory and semantic features to condition a source separation network.…”

Section: B Audio-visual Sound Source Separation For Non-speech Signalsmentioning

confidence: 99%

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

188

View full text Add to dashboard Cite

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deeplearning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.

show abstract

Section: B Audio-visual Sound Source Separation For Non-speech Signalsmentioning

confidence: 99%

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

188

View full text Add to dashboard Cite

show abstract

“…Recent works [64,17,63,25,65,66,21,67] have started to exploit visual information (e.g. talking face, playing instruments) to solve the sound separation task.…”

Section: Introductionmentioning

confidence: 99%

“…While visual motions may be important under certain circumstances (e.g. separating similar type of sources), the single visual frame based approaches have demonstrated surprisingly well performance in [64,65,66]. In this paper, we focus on improving the single visual frame based sound separation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

V-SlowFast Network for Efficient Visual Sound Separation

Zhu¹,

Rahtu²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small-and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2% reduction in the number of model parameters and 81.4% reduction in GMACs compared to the previous multi-stage models. Project page: https://lyzhu.github.io/V-SlowFast.

show abstract

“…Previous works have proposed models to controllably generate e.g. images [13,17,38,45,48,51,55,57,73,76,77], videos [6,12,25,37,42,46,64,65,65,71], and audios [1,9,15,22,24,47,62,63], or separate sounds [18,19,79,80,84]. However, most of the audio works are music-related, and only a few attempts have been made to generate visually guided audio in an open domain setup [11,83].…”

mentioning

confidence: 99%

Taming Visually Guided Sound Generation

Iashin,

Rahtu

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

Cited by 17 publications

References 51 publications

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

V-SlowFast Network for Efficient Visual Sound Separation

Taming Visually Guided Sound Generation

Contact Info

Product

Resources

About