2020
DOI: 10.1109/tip.2020.3009820
|View full text |Cite
|
Sign up to set email alerts
|

Generating Visually Aligned Sound From Videos

Abstract: We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated outside a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content and these irrelevant sounds. To address this challenge, we propose a framework named REGNET. In this framework, we first extract appearance and motion features from vid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
44
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 66 publications
(45 citation statements)
references
References 31 publications
1
44
0
Order By: Relevance
“…Another similar work [27] generated music for a given video. [28] generated visually aligned sounds from videos. In sound2sight [29], future video frames and motion dynamics are generated by conditioning on audio and a few past frames.…”
Section: Related Workmentioning
confidence: 99%
“…Another similar work [27] generated music for a given video. [28] generated visually aligned sounds from videos. In sound2sight [29], future video frames and motion dynamics are generated by conditioning on audio and a few past frames.…”
Section: Related Workmentioning
confidence: 99%
“…Chen et al [5] exploited conditional generative adversarial networks to generate cross-modal audio visuals of musical performances. Chen et al [6] designed an audio forwarding regularizer that could control the irrelevant sound component, thereby preventing the model from learning incorrect mapping between the video frames and the sound emitted by the out-ofscreen objects. Akbari et al [3] tried to reconstruct natural sounding speech using a neural network that takes as input the face region of the talker and estimates bottleneck features extracted from the auditory spectrogram by a pre-trained autoencoder.…”
Section: A Visually Aligned Sound Synthesismentioning
confidence: 99%
“…For example, the well-known MoCo [33], Sim-CLR [34], and BYOL [35] perform basic data augmentation operations on unlabeled images to obtain anchors, positive examples and negative examples, and then minimize the distances between positive pairs while maximizing those between negative pairs. In addition, some studies also design various pretext tasks such as image colorization [36], jigsaw puzzles [37], and image inpainting [38] for unlabeled image data and vehicle tracking [39], relative speed perception [40], background erasure [41], and sound generation [42] for unlabeled audio and video data. Based on models trained on pretext tasks, self-supervised learning can further transfer the learned features to downstream tasks.…”
Section: A Recognition From Web Datamentioning
confidence: 99%