2022
DOI: 10.1007/978-3-031-19784-0_41
|View full text |Cite
|
Sign up to set email alerts
|

Text2LIVE: Text-Driven Layered Image and Video Editing

Abstract: Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long retraining and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
58
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 136 publications
(58 citation statements)
references
References 57 publications
0
58
0
Order By: Relevance
“…Social interaction ever more takes place in mixed reality in the dawn of the metaverse (Zhang et al, 2023;Filipova, 2023). AI is transforming all scales of cognitive and social identities and interaction into life as we don't know it (Bar-Tal et al, 2022;Gilson et al, 2022;Chen, Hu, Saharia and Cohen, 2022;Cahan and Treutlein, 2023;King and chatGPT, 2023). This section argues that the ways in which one navigates AI-permated environments can be understood as a multiscale bio-cultural form of Augmented Cognition (AugCog).…”
Section: Preprint -Please Cite the Originalmentioning
confidence: 99%
“…Social interaction ever more takes place in mixed reality in the dawn of the metaverse (Zhang et al, 2023;Filipova, 2023). AI is transforming all scales of cognitive and social identities and interaction into life as we don't know it (Bar-Tal et al, 2022;Gilson et al, 2022;Chen, Hu, Saharia and Cohen, 2022;Cahan and Treutlein, 2023;King and chatGPT, 2023). This section argues that the ways in which one navigates AI-permated environments can be understood as a multiscale bio-cultural form of Augmented Cognition (AugCog).…”
Section: Preprint -Please Cite the Originalmentioning
confidence: 99%
“…Text-conditioned generation The field of text-to-image generation has made significant progress in recent years, mainly using CLIP as a representation extractor. Many works use CLIP to optimize a latent vector in the representation space of a pretrained GAN [10,17,30,37], others utilize CLIP to provide classifier guidance for a pretrained diffusion model [3], and [5] employ CLIP to optimize a Deep Image Prior model [52] that correctly edits an image. Recently, the field has shifted from employing CLIP as a loss network for optimization, and into using it as a backbone in huge generative models [41,45], resulting in impressive photorealistic results.…”
Section: Related Workmentioning
confidence: 99%
“…Since the advent of CLIP [39], training large visionlanguage models (VLMs) has become a prominent paradigm for representation learning in computer vision. By observing huge corpora of paired images and captions crawled from the Web, these models learn a powerful and rich joint image-text embedding space, which have been employed in numerous visual tasks, including classification [60,61], segmentation [28,57], motion generation [49], image captioning [32,50], text-to-image generation [10,30,34,42,46] and image or video editing [3,5,7,17,24,37,54]. Recently, VLMs have also been a key component in text-toimage generative models [4,40,42,45], which rely on their textual representations to encapsulate the rich and semantic meaning of the input text prompt.…”
Section: Introductionmentioning
confidence: 99%
“…Patashnik et al [44] adopt the CLIP model for semantic alignment between text and image, and propose mapping the text prompts to inputagnostic directions in StyleGAN's style space, achieving interactive text-driven image manipulation. Text2LIVE [45] introduces an edit layer to composite the generation results with the image to preserve the information. The edit layer is directly predicted by a U-Net model.…”
Section: Related Workmentioning
confidence: 99%