2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00041
|View full text |Cite
|
Sign up to set email alerts
|

2.5D Visual Sound

Abstract: Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene. However, binaural recordings are scarcely available and require nontrivial expertise and equipment to obtain. We propose to convert common monaural audio into binaural audio by leveraging video. The key idea is that visual frames reveal significant spatial cues that, while explicitly lacking in the accompanying single-channel audio, are strongly linked to it. Our multi-modal approach recovers this li… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
257
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 173 publications
(258 citation statements)
references
References 46 publications
1
257
0
Order By: Relevance
“…Audio-Visual Source Separation Early methods for audio-visual source separation focus on mutual information [10], subspace analysis [42,34], matrix factorization [33,39], and correlated onsets [5,27]. Recent methods leverage deep learning for separating speech [8,31,3,11], musical instruments [52,13,51], and other objects [12]. Similar to the audio-only methods, almost all use a "mixand-separate" training paradigm to perform video-level separation by artificially mixing training videos.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Audio-Visual Source Separation Early methods for audio-visual source separation focus on mutual information [10], subspace analysis [42,34], matrix factorization [33,39], and correlated onsets [5,27]. Recent methods leverage deep learning for separating speech [8,31,3,11], musical instruments [52,13,51], and other objects [12]. Similar to the audio-only methods, almost all use a "mixand-separate" training paradigm to perform video-level separation by artificially mixing training videos.…”
Section: Related Workmentioning
confidence: 99%
“…Generating Sounds from Video Sound generation methods synthesize a sound track from a visual input [32,54,6]. Given both visual input and monaural audio, recent methods generate spatial (binaural or ambisonic) audio [13,30]. Unlike any of the above, our work aims to separate an existing real audio track, not synthesize plausible new sounds.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…These may imply that, by the definition of learnability [12], the task is not a fully learnable problem only with unsupervised data in our setting, which is static-image based single-channel audio source localization, but can be fixed with even a small amount of relevant prior knowledge. Although the sound localization task is not effectively addressed with our unsupervised learning approach with static images and mono audios, other methods that use spatial microphones [25], [53], [54], [55] or temporal information, motion [8] and synchronization [18], with multiple frames have been shown to perform well on this task with unsupervised algorithms. In the following, we conclude our work with additional discussion for future investigation.…”
Section: Discussionmentioning
confidence: 99%