2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00896
|View full text |Cite
|
Sign up to set email alerts
|

SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text

Abstract: Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. Existing approaches either require styled training captions aligned to images or generate captions with low relevance. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
100
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 120 publications
(101 citation statements)
references
References 50 publications
(102 reference statements)
0
100
0
1
Order By: Relevance
“…In [69], the cross-domain problem is addressed with a cycle objective. Similarly, unpaired data can be used to generate stylized descriptions [22,46]. Anderson et al [3] propose a method to complete partial sequence data, e.g.…”
Section: Related Workmentioning
confidence: 99%
“…In [69], the cross-domain problem is addressed with a cycle objective. Similarly, unpaired data can be used to generate stylized descriptions [22,46]. Anderson et al [3] propose a method to complete partial sequence data, e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Alexander et al propose two-stage style transfer for image captioning. They first extract objects and verbs from an image and generate a stylish caption using an RNN trained on story corpora [6]. However, each such caption is only a single story-like sentence and is independent of other captions; put together, the captions do not constitute a context-coherent story.…”
Section: Related Workmentioning
confidence: 99%
“…To train the term prediction model, we extract terms from captions in the COCO dataset [5] as the gold labels for the first stage to identity terms in images. The selection of terms is inspired by Semstyle [6]. Each sentence is represented as a combination of several noun (objects) and verb (actions) terms, which preserves the most crucial information needed to generate stories.…”
Section: Data Preparationmentioning
confidence: 99%
See 1 more Smart Citation
“…Availability of large curated datasets such as MS-COCO [41] (100K images), Flickr30K [64] (30K images) or Conceptual Captions [74] (3M images) made it possible to train deep learning models for complex, multi-modal tasks such as natural image captioning (NIC) [81] where the goal is to factually describe the image content. Similarly, several other captioning variants such as visual question answering [5], visual storytelling [38], stylized captioning [56] have also been explored. Recently, the PCCD dataset (∼ 4200 images) [11] opened up a new area of research of describing images aesthetically.…”
Section: Introductionmentioning
confidence: 99%