TACR-Net: Editing on Deep Video and Voice Portraits

Song, Luchuan; Liu, Bin; Yin, Guojun; Dong, Xiaoyi; Zhang, Yufei; Bai, Jia-Xuan

doi:10.1145/3474085.3475196

Cited by 17 publications

(6 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparative study is done between the various deep learning algorithms such as TAC-NETS [43], RESNET-34 [44], MULTI-CNN…”

Section: Comparative Analysis and Discussionmentioning

confidence: 99%

FED-AT-VIDEO NETS - A Federated Capsule – Self Gated Learning Architecture for the Multi-View Video Summarization Technique.

KANDASWAMY,

BALACHANDER

2023

Preprint

View full text Add to dashboard Cite

Video analytics using the huge amount of data from the surveillance networks has become a core function for multiple applications such as object detection, human activity recognition, health care diagnosis and so on. Due to its massive nature, achieving efficient video summarization has become a vital challenge for constructing video analytics architecture. Moreover, these video contains private information, and security against different intruders has also added fuel to the existing challenges. In recent years, a number of architectures have been proposed to achieve better and secured multi-view summarization (MVS) techniques that can aid in better video analytics. Unfortunately, existing architecture needs brighter light of research to eradicate the aforementioned challenges. In this article, federated deep gated attention architecture (FDGAA) is proposed for attaining the secured MVS by organizing the computing and networking resources of cloud and edge cameras collectively. The proposed architecture is modeled as a three-tier framework which is precisely given as 1) Video collection unit (VDU) that collects the videos from the different views of the camera installed. 2) Distributed Training network(DTN) which consists of federated learning self-attention saliency Gated recurrent units(SAS-GRU) in which the training is collaboratively shared among the edges without sacrificing the privacy of video information. 3) Finally the extracted deep features are summarized in the cloud for further processing. Utilizing a variety of datasets and NVIDIA Nano Boards as edge nodes, substantial research is conducted to develop the Google Federated Tensorflow Libraries-based federated learning architecture. Performance has been compared with other MVS systems that are currently based on deep learning to demonstrate the superiority of the proposed framework. In comparison to other state-of-the-art MVS approaches, the experimental evaluation shows that the suggested model performs better.

show abstract

“…Comparative study is done between the various deep learning algorithms such as TAC-NETS [43], RESNET-34 [44], MULTI-CNN…”

Section: Comparative Analysis and Discussionmentioning

confidence: 99%

FED-AT-VIDEO NETS - A Federated Capsule – Self Gated Learning Architecture for the Multi-View Video Summarization Technique.

KANDASWAMY,

BALACHANDER

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…As touched upon a bit earlier, audio-driven deep fakes can be categorised by whether they are generated by leveraging an audio driven structural representation of the face, or without. There have been numerous approaches over the years relating to the former, ranging from ones such as [2,7,10,16,19,31,39,56,66,68,74,75,79,91] which generate a set of 2D facial landmark co-ordinates from audio, or [8,15,32,37,52,62,63,69,76,77,[83][84][85]87] which predict expression parameters from audio to drive a 3D face model. What these approaches all have in common is that they use these intermediate structural representations as input to a separate neural rendering model which is typically trained as an image to image translation task to generate the final photo realistic image frame.…”

Section: Audio Driven Video Generationmentioning

confidence: 99%

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Bigioi¹,

Basak²,

Jordan³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing. 1

show abstract

“…• Talking-Head Video Synthesis. In talking-head video synthesis, some pipelines [74,38,85,46,67,47,72] for high-quality face synthesis usually extracts the 3D face parameters from the target face images through 3D face models [6,73,24], and generates the 3D face parameters from source speech or text, and then generates the face images from the generated 3D face parameters. • Image/Video/Sound Generation.…”

Section: Applications Of Regeneration Learningmentioning

confidence: 99%

“…• The source data X and target data Y have too much uncorrelated information (i.e., X ∩Y X ∪Y ), such as lyric/video and melody in conditional melody generation [39,81,15,92,19], speech and face images in talking-head video synthesis [74,38,85,46,67,47]. Directly learning the mapping between X and Y would lead to overfitting.…”

Section: Applications Of Regeneration Learningmentioning

confidence: 99%

Regeneration Learning: A Learning Paradigm for Data Generation

Xu¹,

Qin²,

Bian³

et al. 2023

Preprint

View full text Add to dashboard Cite

Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y . The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y (an abstraction/representation of Y ) from X and then generates Y from Y . During training, Y is obtained from Y through either handcrafted rules or selfsupervised learning and is used to learn X → Y and Y → Y . Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y ) of the target data Y for data generation while traditional representation learning handles the abstraction (X ) of source data X for data understanding; 2) both the processes of Y → Y in regeneration learning and X → X in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y in regeneration learning and from X to Y in representation learning are simpler than the direct mapping from X to Y . We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.

show abstract

TACR-Net: Editing on Deep Video and Voice Portraits

Cited by 17 publications

References 35 publications

FED-AT-VIDEO NETS - A Federated Capsule – Self Gated Learning Architecture for the Multi-View Video Summarization Technique.

FED-AT-VIDEO NETS - A Federated Capsule – Self Gated Learning Architecture for the Multi-View Video Summarization Technique.

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Regeneration Learning: A Learning Paradigm for Data Generation

Contact Info

Product

Resources

About