2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2019
DOI: 10.1109/waspaa.2019.8937237
|View full text |Cite
|
Sign up to set email alerts
|

Identify, Locate and Separate: Audio-Visual Object Extraction in Large Video Collections Using Weak Supervision

Abstract: We tackle the problem of audio-visual scene analysis for weaklylabeled data. To this end, we build upon our previous audio-visual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challengin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 25 publications
0
11
0
Order By: Relevance
“…A variety of neural network methods including fully-connected neural networks [13], convolutional neural networks (CNNs) [14,30,31] and convolutional recurrent neural networks (CRNNs) [32,33] have been explored for audio tagging. For sound localization, an identify, locate and separate model [34] was proposed for audio-visual object extraction in large video collections using weak supervision.…”
Section: Audio Tagging With Weakly Labelled Datamentioning
confidence: 99%
“…A variety of neural network methods including fully-connected neural networks [13], convolutional neural networks (CNNs) [14,30,31] and convolutional recurrent neural networks (CRNNs) [32,33] have been explored for audio tagging. For sound localization, an identify, locate and separate model [34] was proposed for audio-visual object extraction in large video collections using weak supervision.…”
Section: Audio Tagging With Weakly Labelled Datamentioning
confidence: 99%
“…At the same time, the field of visually assisted source separation has emerged [10], [44], [45], [46], [47], in particular, with explicit focus on musical data [2], [7], [8], [9], [10]. Starting with capturing only visual appearance features [7], [8], [9], [10] there is a shift towards capturing and integrating motion data [2].…”
Section: Audio-visual Deep Learning Methodsmentioning
confidence: 99%
“…To combine the data obtained from different modalities, commonly used approaches include late fusion [47], conditioning at the bottleneck via tile-and-multiply [9], concatenation [45], attention mechanism [2], [7], and Featurewise Linear Modulation (FiLM) conditioning [2], [48] (more details in Section 2.3). In the present work, we also analyse different ways to combine audio and visual information and extend prior work for multiple and variable number of sources in the mixture.…”
Section: Audio-visual Deep Learning Methodsmentioning
confidence: 99%
“…The model is too complicated and lacks explanation Morrone et al [21] Use landmarks to generate time-frequency masks Additional landmark detection required Gao et al [25] Disentangle audio frequencies related to visual objects Separated audio only Senocak et al [26] Focus on the primary area by using attention Localized sound source only Tian et al [27] Joint modeling of auditory and visual modalities Localized sound source only Separate and Localize Objects' Sounds Pu et al [19] Use low rank to extract the sparsely correlated components Not for the in-the-wild environment Zhao et al [28] Mix and separate a given audio; without traditional supervision Motion information is not considered Zhao et al [29] Introduce motion trajectory and curriculum learning Only suitable for synchronized video and audio input Rouditchenko et al [30] Separation and localization use only one modality input Does not fully utilize temporal information Parekh et al [31] Weakly supervised learning via multiple-instance learning…”
Section: Methods Ideas and Strengths Weaknessesmentioning
confidence: 99%
“…In other words, given a video frame or a sound, the approach used the category-to-feature-channel correspondence to select a specific type of source or object for separation or localization. Aiming to introduce weak labels to improve performance, Parekh et al [31] designed an approach based on multiple-instance learning, a well-known strategy for weakly supervised learning.…”
Section: Methods Ideas and Strengths Weaknessesmentioning
confidence: 99%