ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9052918
|View full text |Cite
|
Sign up to set email alerts
|

Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…We have designated alongside every evaluation metric with arrows, where up-arrow (↑) indicates that a larger value is better and similarly the down-arrow (↓) suggests that a lower value is better. [30] audios and images are generated using the conditional GAN, while we generate videos and audio, which is more challenging using the joint learning. They also use the class label while training the model.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…We have designated alongside every evaluation metric with arrows, where up-arrow (↑) indicates that a larger value is better and similarly the down-arrow (↓) suggests that a lower value is better. [30] audios and images are generated using the conditional GAN, while we generate videos and audio, which is more challenging using the joint learning. They also use the class label while training the model.…”
Section: Discussionmentioning
confidence: 99%
“…In sound2sight [29], future video frames and motion dynamics are generated by conditioning on audio and a few past frames. In SA-CMGAN [30] self-attention mechanism is applied to cross-modal visual-audio generation.…”
Section: Related Workmentioning
confidence: 99%
“…It also achieves an end-to-end output with a Video-to-Audio (V-A) model. In [9], an SA-CMGAN model was introduced, which uses two networks to mutually generate images and audio. Each network includes an encoder, generator, and discriminator.…”
Section: Basic Methods and Principles Mainly Involved By Foleymentioning
confidence: 99%
“…Chen et al [9] focused on the generation of an image from the audio and vice-versa for single-instrument performance videos from the URMP dataset [39] using two Generative Adversarial Nets (GAN) [21] while Hao et al [24] improved the performance of the GAN with cross-modal cycle-consistency [82]. Furthermore, Tan et al [62] incorporated self-attention [68] into the GAN architecture and Su et al [60] proposed to generate a piano sound by vocoding Midi predicted from a video. Recently, Kurmi et al [36] brought a generation of short (1s) musical videos into the picture.…”
Section: Related Workmentioning
confidence: 99%
“…Previous works have proposed models to controllably generate e.g. images [13,17,38,45,48,51,55,57,73,76,77], videos [6,12,25,37,42,46,64,65,65,71], and audios [1,9,15,22,24,47,62,63], or separate sounds [18,19,79,80,84]. However, most of the audio works are music-related, and only a few attempts have been made to generate visually guided audio in an open domain setup [11,83].…”
mentioning
confidence: 99%