2020
DOI: 10.1007/978-3-030-58580-8_26
|View full text |Cite
|
Sign up to set email alerts
|

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
110
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 107 publications
(116 citation statements)
references
References 58 publications
2
110
0
Order By: Relevance
“…The community has attracted an increasing amount of interest in recent years since synchronized audio-visual scenes are widely available in videos. In addition to localizing sound sources, a wide range of tasks have been proposed, including audio-visual sound separation [7,9,26,34,35], audio-visual action recognition [10,17,19,30], audio-visual event localization [27,33], audio-visual video captioning [23,28,32], embodied audio-visual navigation [4,8], audio-visual sound recognition [5], and audio-visual video parsing [29]. Our framework demonstrates that temporal learning facilitates better audio-visual understanding, which explicitly and subsequently benefits the localization performance.…”
Section: Audio-visual Video Understandingmentioning
confidence: 99%
“…The community has attracted an increasing amount of interest in recent years since synchronized audio-visual scenes are widely available in videos. In addition to localizing sound sources, a wide range of tasks have been proposed, including audio-visual sound separation [7,9,26,34,35], audio-visual action recognition [10,17,19,30], audio-visual event localization [27,33], audio-visual video captioning [23,28,32], embodied audio-visual navigation [4,8], audio-visual sound recognition [5], and audio-visual video parsing [29]. Our framework demonstrates that temporal learning facilitates better audio-visual understanding, which explicitly and subsequently benefits the localization performance.…”
Section: Audio-visual Video Understandingmentioning
confidence: 99%
“…Cross-modal learning is explored to understand the natural synchronisation between visuals and the audio [3,5,39]. Audio-visual data is leveraged for audio-visual speech recognition [12,28,59,62], audio-visual event localization [51,52,55], sound source localization [4,29,45,49,51,60], self-supervised representation learning [25,31,35,37,39], generating sounds from video [10,19,38,64], and audio-visual source separation for speech [1,2,13,16,18,37], music [20,22,56,60,61], and objects [22,24,53]. In contrast to all these methods, we perform a different task: to produce binaural two-channel audio from a monaural audio clip using a video's visual stream.…”
Section: Introductionmentioning
confidence: 99%
“…In the other seconds, the food frying can both be heard and seen, thus we label these as frying. To make this task more generalizable, Tian et al [36] expand the task of localizing one event to multiple events scenarios and introduce the audio-visual video parsing task, which is illustrated in Fig. 1(b), given a video that includes several audible, visible, and audi-visible events, the audio-visual video parsing task aims to predict all event categories, distinguish the modalities perceiving each event, and localize their temporal boundaries.…”
Section: Introductionmentioning
confidence: 99%
“…For audio-visual event localization, prior works [24,32,37,[43][44][45]52] explore the relationship between auditory and visual sequences via different kinds of attention mechanisms. For audio-visual video parsing, Tian et al [36] propose a hybrid attention network to capture the temporal context of the whole video sequence, which tends to focus more on the holistic content and is capable to detect the major event throughout the video. However, these methods are limited by some cases including when the lengths of target events are short, or videos include several events that have various lengths.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation