Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3478561
|View full text |Cite
|
Sign up to set email alerts
|

A Picture is Worth a Thousand Words

Abstract: A creative image-and-text generative AI system mimics humans' extraordinary abilities to provide users with diverse and comprehensive caption suggestions, as well as rich image creations. In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images. When users imagine an image and associate it with multiple captions, our system paints a rich image to reflect all captions faithfully. Likewise, when users upload an image, our system depicts it with multiple diverse cap… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 17 publications
0
1
0
Order By: Relevance
“…Kim et al (Kim et al 2022) proposed the L-Verse with a feature-augmented variational autoencoder and bidirectional auto-regressive transformers for the image-text bidirectional generating. Huang et al (Huang et al 2021a) exploited a transformer to synthesize highquality images conditioned on multiple captions. Esser et al proposed the ImageBART to synthesize images in a coarse-to-fine manner by using autoregressive models and the multinomial diffusion process.…”
Section: Transformer-based Text-to-image Synthesismentioning
confidence: 99%
“…Kim et al (Kim et al 2022) proposed the L-Verse with a feature-augmented variational autoencoder and bidirectional auto-regressive transformers for the image-text bidirectional generating. Huang et al (Huang et al 2021a) exploited a transformer to synthesize highquality images conditioned on multiple captions. Esser et al proposed the ImageBART to synthesize images in a coarse-to-fine manner by using autoregressive models and the multinomial diffusion process.…”
Section: Transformer-based Text-to-image Synthesismentioning
confidence: 99%