2018 25th IEEE International Conference on Image Processing (ICIP) 2018
DOI: 10.1109/icip.2018.8451656
|View full text |Cite
|
Sign up to set email alerts
|

Semantically Invariant Text-to-Image Generation

Abstract: Image captioning has demonstrated models that are capable of generating plausible text given input images or videos. Further, recent work in image generation has shown significant improvements in image quality when text is used as a prior. Our work ties these concepts together by creating an architecture that can enable bidirectional generation of images and text. We call this network Multi-Modal Vector Representation (MMVR). Along with MMVR, we propose two improvements to the text conditioned image generation… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 23 publications
0
10
0
Order By: Relevance
“…As a result, many approaches now use attention mechanisms to attend to specific words of the sentence [7], use intermediate representations such as scene layouts [2], condition on additional information such as object bounding boxes [3] or perform interactive image refinement [12]. Other approaches generate images directly from semantic layouts without additional textual input [13], [14]or perform a translation from text to images and back [15], [16].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…As a result, many approaches now use attention mechanisms to attend to specific words of the sentence [7], use intermediate representations such as scene layouts [2], condition on additional information such as object bounding boxes [3] or perform interactive image refinement [12]. Other approaches generate images directly from semantic layouts without additional textual input [13], [14]or perform a translation from text to images and back [15], [16].…”
Section: Related Workmentioning
confidence: 99%
“…captions containing the phrase "hot dog" are evaluated based on the assumption that the image should contain a dog). [15] introduce a detection score that calculates (roughly) whether a pre-trained object detector detects an object in a generated image with high certainty. However, no information from the caption is taken into account, meaning any detection with high confidence is "good" even if the detected object does not make sense in the context of the caption.…”
Section: Semantic Object Accuracy (Soa)mentioning
confidence: 99%
“…"hot dog" for "dog") in addition to a list of false positive captions. In contrast to [140] which also proposed a detection based evaluation metric, SOA does take the caption into account. Table A.5 shows SOA scores as reported in [108].…”
Section: Image-text Alignment Metricsmentioning
confidence: 99%
“…Instructions can vary between choosing the "best" without specifying what it means, to precise directives such as rating whether objects are identifiable and/or match the input description. For example, users have been asked to rank images based on the relevance of text [109], to select the image which best depicts the caption [141,108], to rate whether any one object is identifiable, and how well the image aligns with the text given [140], to select the more convincing image and the one which is more semantically consistent with the ground truth [113]. While some report average ranks, others report the ratio of being ranked first.…”
Section: User Studiesmentioning
confidence: 99%
“…Given a scene sketch, Gao et al [7] implemented a controllable image generation method to meet the specific requirements. Also, there are many generative models [18,9,28,25] that take the text as the input for multi-modal text-to-image generation.…”
Section: 2mentioning
confidence: 99%