ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413802
|View full text |Cite
|
Sign up to set email alerts
|

Collaborative Learning to Generate Audio-Video Jointly

Abstract: There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 21 publications
0
4
0
Order By: Relevance
“…On the one hand, the possibility of artificially generating several data based on the few existing data using a Generative Adversarial Network in order to subsequently train the deep learning models is discussed [45], [46]. On the other hand, transfer learning is presented, which represents the possibility of "further training" another model based on an already trained deep learning model with just a little data and using the finished model for this purpose.…”
Section: B Gesture Recognition With Deep Learningmentioning
confidence: 99%
“…On the one hand, the possibility of artificially generating several data based on the few existing data using a Generative Adversarial Network in order to subsequently train the deep learning models is discussed [45], [46]. On the other hand, transfer learning is presented, which represents the possibility of "further training" another model based on an already trained deep learning model with just a little data and using the finished model for this purpose.…”
Section: B Gesture Recognition With Deep Learningmentioning
confidence: 99%
“…Further assessment revealed that the model could learn features to reckon actions at minimum supervision-scene dynamics are viable for representation learning. Several works were proposed for same purpose using GANs [404,406,407] The Motion and Content decomposed GAN (MoCoGAN) was introduced by Tulyakov et al [408] The translation of input to output images can be performed using CGAN-a recurring theme in computer vision, computer graphics, and image processing. This pix2pix model resolves these image-related issues [415][416][417].…”
Section: Video Prediction and Generationmentioning
confidence: 99%
“…The model efficacy was verified empirically via quantitative and qualitative approaches. This approach has been improved in different ways[360,404,407].5. Anime character generationApart from requiring experts for routine tasks, animation production and game development are costly.…”
mentioning
confidence: 99%
“…Several studies showcase the success of adversarial learning framework in a variety of applications such as image generation [11], [15], audio-generation [20], domain adaptation [9], [23], [25], image in-painting [32], [53], incremental learning [24] and fairness leaning. All of these approaches optimize the network with an adversarial discriminator.…”
Section: Adversarial Learningmentioning
confidence: 99%