Everybody Dance Now

Chan, Caroline; Ginosar, Shiry; Zhou, Tao; Efros, Alexei A.

doi:10.1109/iccv.2019.00603

Cited by 673 publications

(618 citation statements)

References 51 publications

Supporting

Mentioning

616

Contrasting

Unclassified

Order By: Relevance

“…The first is in using gestures as a representation for video analysis: co-speech hand and arm motion make a natural target for video prediction tasks. The second is using in-the-wild gestures as a way of training conversational agents: we presented one way of visualizing gesture predictions, based on GANs [10], but, following classic work [8], these predictions could also be used to drive the motions of virtual agents. Finally, our method is one of only a handful of initial attempts to predict motion from audio.…”

Section: Resultsmentioning

confidence: 99%

“…Since our work studies personalized gestures for in-the-wild videos, where 3D data is not available, we use a data-driven synthesis approach inspired by Bregler et al [2]. To do this, we employ the pose-to-video method of Chan et al [10], which uses a conditional generative adversarial network (GAN) to synthesize videos of human bodies from pose.…”

Section: Conversational Agentsmentioning

confidence: 99%

“…This can be seen in our supplementary video results. To combat the issue and ensure that we produce realistic motion, we add an adversarial discriminator [22,10] D, conditioned on the difference of the predicted sequence of poses. i.e.…”

Section: Predicting Plausible Motionmentioning

confidence: 99%

See 2 more Smart Citations

Learning Individual Styles of Conversational Gesture

Ginosar

Bar

Kohavi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

290

288

View full text Add to dashboard Cite

Figure 1: Speech-to-gesture translation example. In this paper, we study the connection between conversational gesture and speech. Here, we show the result of our model that predicts gesture from audio. From the bottom upward: the input audio, arm and hand pose predicted by our model, and video frames synthesized from pose predictions using [10]. AbstractHuman speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild" monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Conversational Agentsmentioning

confidence: 99%

See 1 more Smart Citation

Learning Individual Styles of Conversational Gesture

Ginosar

Bar

Kohavi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

290

288

View full text Add to dashboard Cite

show abstract

“…Sun et al [41] propose a two-stage framework to perform head inpainting conditioned on the generated facial landmark in the first stage. Chan et al [5] propose a method to transfer motion between human subjects based on pose stick figures in different videos. Yan et al [54] propose a method to generate human motion sequence with a simple background using CGAN and human skeleton information.…”

Section: Related Workmentioning

confidence: 99%

Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Tang

Liu

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

104

View full text Add to dashboard Cite

In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (C 2 GAN) for the task of keypoint-guided image generation. The proposed C 2 GAN is a cross-modal framework exploring a joint exploitation of the keypoint and the image data in an interactive manner. C 2 GAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Both of them are mutually connected in an end-to-end learnable fashion and explicitly form three cycled sub-networks, i.e., one image generation cycle and two keypoint generation cycles. Each cycle not only aims at reconstructing the input domain, and also produces useful output involving in the generation of another cycle. By so doing, the cycles constrain each other implicitly, which provides complementary information from the two different modalities and brings extra supervision across cycles, thus facilitating more robust optimization of the whole network. Extensive experimental results on two publicly available datasets, i.e., Radboud Faces [19] and Market-1501 [58], demonstrate that our approach is effective to generate more photo-realistic images compared with state-of-the-art models.

show abstract

“…In Chan et al [15] temporal information is introduced in a cGAN architecture by adding extra conditions on their generator and discriminator. The current input and previously generated image are conditions for their generator.…”

Section: B Network Architecturementioning

confidence: 99%

Automatic Temporally Coherent Video Colorization

Thasarathan

Nazeri

Ebrahimi

2019

2019 16th Conference on Computer and Robot Vision (CRV)

View full text Add to dashboard Cite

Greyscale image colorization for applications in image restoration has seen significant improvements in recent years. Many of these techniques that use learning-based methods struggle to effectively colorize sparse inputs. With the consistent growth of the anime industry, the ability to colorize sparse input such as line art can reduce significant cost and redundant work for production studios by eliminating the in-between frame colorization process. Simply using existing methods yields inconsistent colors between related frames resulting in a flicker effect in the final video. In order to successfully automate key areas of large-scale anime production, the colorization of line arts must be temporally consistent between frames. This paper proposes a method to colorize line art frames in an adversarial setting, to create temporally coherent video of large anime by improving existing image to image translation methods. We show that by adding an extra condition to the generator and discriminator, we can effectively create temporally consistent video sequences from anime line arts.

show abstract

Everybody Dance Now

Cited by 673 publications

References 51 publications

Learning Individual Styles of Conversational Gesture

Learning Individual Styles of Conversational Gesture

Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Automatic Temporally Coherent Video Colorization

Contact Info

Product

Resources

About