Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

Tian, Yapeng; Li, Dingzeyu; Xu, Chenliang

doi:10.1007/978-3-030-58580-8_26

Cited by 107 publications

(116 citation statements)

References 58 publications

Supporting

Mentioning

110

Contrasting

Order By: Relevance

“…The community has attracted an increasing amount of interest in recent years since synchronized audio-visual scenes are widely available in videos. In addition to localizing sound sources, a wide range of tasks have been proposed, including audio-visual sound separation [7,9,26,34,35], audio-visual action recognition [10,17,19,30], audio-visual event localization [27,33], audio-visual video captioning [23,28,32], embodied audio-visual navigation [4,8], audio-visual sound recognition [5], and audio-visual video parsing [29]. Our framework demonstrates that temporal learning facilitates better audio-visual understanding, which explicitly and subsequently benefits the localization performance.…”

Section: Audio-visual Video Understandingmentioning

confidence: 99%

Space-Time Memory Network for Sounding Object Localization in Videos

Tian

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods. Code and data can be found at https://sites.google.com/view/bmvc2021stm. IntroductionNeurological evidence suggests that human understandings of scenes predominantly rely on the integration of visual and auditory cues [3]. As humans, we form attention mechanisms to sounding sources by leveraging the temporal, cross-modal alignments between vision and sound, where understandings of the past tell us where and what to attend to next. For computational models, although there have been several developed sound source spatial localization frameworks [21,22,27], how much we gain from explicitly leveraging temporal correspondence that exists naturally in both videos and audios is yet to be explored.However, considerations of temporal coherence are required to facilitate consistent understandings in complex scenes. Imagine a person playing a guitar in front of a wall of not-in-use guitars. In order to figure out which guitar is sounding and obtain stable localization results, we must take multiple timesteps into account. Hence, it is worthwhile to explore if learning temporal cues could benefit the localization of sounding objects in videos.To localize visual objects associated with specific sound sources, most of the previous works capture audio-visual spatial correspondence using similarities between audio and visual modalities [2,15,21], cross-modal attention mechanisms [25,27], and sounding class activation mapping [22]. Nevertheless, these methods often identify sounding objects for static images, and audio-visual temporal coherence in videos is commonly ignored.

show abstract

Section: Audio-visual Video Understandingmentioning

confidence: 99%

Space-Time Memory Network for Sounding Object Localization in Videos

Tian

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Cross-modal learning is explored to understand the natural synchronisation between visuals and the audio [3,5,39]. Audio-visual data is leveraged for audio-visual speech recognition [12,28,59,62], audio-visual event localization [51,52,55], sound source localization [4,29,45,49,51,60], self-supervised representation learning [25,31,35,37,39], generating sounds from video [10,19,38,64], and audio-visual source separation for speech [1,2,13,16,18,37], music [20,22,56,60,61], and objects [22,24,53]. In contrast to all these methods, we perform a different task: to produce binaural two-channel audio from a monaural audio clip using a video's visual stream.…”

Section: Introductionmentioning

confidence: 99%

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

Rishabh¹,

Gao²,

Grauman³

2021

Preprint

View full text Add to dashboard Cite

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (singlechannel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream's coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce a new large video dataset with realistic binaural audio simulated for real-world scanned environments. On two datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

show abstract

“…In the other seconds, the food frying can both be heard and seen, thus we label these as frying. To make this task more generalizable, Tian et al [36] expand the task of localizing one event to multiple events scenarios and introduce the audio-visual video parsing task, which is illustrated in Fig. 1(b), given a video that includes several audible, visible, and audi-visible events, the audio-visual video parsing task aims to predict all event categories, distinguish the modalities perceiving each event, and localize their temporal boundaries.…”

Section: Introductionmentioning

confidence: 99%

“…For audio-visual event localization, prior works [24,32,37,[43][44][45]52] explore the relationship between auditory and visual sequences via different kinds of attention mechanisms. For audio-visual video parsing, Tian et al [36] propose a hybrid attention network to capture the temporal context of the whole video sequence, which tends to focus more on the holistic content and is capable to detect the major event throughout the video. However, these methods are limited by some cases including when the lengths of target events are short, or videos include several events that have various lengths.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Cheng

Zhao

et al. 2021

Preprint

View full text Add to dashboard Cite

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in various lengths. In this paper, we present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multilevel temporal features for audio-visual event localization and audio-visual video parsing. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unitlevel attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.

show abstract

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

Cited by 107 publications

References 58 publications

Space-Time Memory Network for Sounding Object Localization in Videos

Space-Time Memory Network for Sounding Object Localization in Videos

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Contact Info

Product

Resources

About