Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3548282
|View full text |Cite
|
Sign up to set email alerts
|

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Abstract: Figure 1: Digital art paintings generated by the proposed multimodal guided artwork diffusion (MGAD) model.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(7 citation statements)
references
References 56 publications
0
7
0
Order By: Relevance
“…2(b), these methods still face the challenge of generating images with styles that are inconsistent with textual prompts. Our research closely follows the previous work [1,4,27,28], focusing on converting multimodal prompts into realistic artistic images and achieving innovations in reconstructing and editing existing images.…”
Section: Related Workmentioning
confidence: 78%
“…2(b), these methods still face the challenge of generating images with styles that are inconsistent with textual prompts. Our research closely follows the previous work [1,4,27,28], focusing on converting multimodal prompts into realistic artistic images and achieving innovations in reconstructing and editing existing images.…”
Section: Related Workmentioning
confidence: 78%
“…These models can generate images that are closely aligned with the input text prompt. Motivated by these successes, many works attempt to utilize pre-trained T2I diffusion models for various tasks such as text-driven image editing [2,13,16,28,29,41]. However, using 2D diffusion model to achieve fine-grained text-driven 3D stylization is seldom explored and remains an open problem in the multimedia and vision field.…”
Section: Related Workmentioning
confidence: 99%
“…One of the large-scale multimodal pre-training models that have received many applications is the Contrastive Language Image Pretraining (CLIP) [38] model, which was pre-trained on 400 million text-image samples. In parallel, various new image synthesis methods [15,18,25,27,37,39] have highlighted the richness of the vast visual and linguistic realm encompassed by CLIP. Nonetheless, manipulating existing items in arbitrary, actual pictures continues to remain tricky.…”
Section: Introductionmentioning
confidence: 99%