2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00425
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Image Captioning

Abstract: Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus, and an existing visual concept detector. The sentence corpus is used to teach … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
223
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 210 publications
(224 citation statements)
references
References 32 publications
0
223
0
1
Order By: Relevance
“…When language and images come from different sources, some weak supervisory signal is needed to align the manifold of visual concepts to the textual domain. Similar to previous work [18], we use a pre-trained object detector to generate an initial noisy alignment between the text source and visual entities that can be detected in the image.…”
Section: Language Domainmentioning
confidence: 99%
See 1 more Smart Citation
“…When language and images come from different sources, some weak supervisory signal is needed to align the manifold of visual concepts to the textual domain. Similar to previous work [18], we use a pre-trained object detector to generate an initial noisy alignment between the text source and visual entities that can be detected in the image.…”
Section: Language Domainmentioning
confidence: 99%
“…Most closely related to our work is [18] which does not require any image-sentence pairs. In this case, it is optimal to use a language domain which is rich in visual concepts.…”
Section: Related Workmentioning
confidence: 99%
“…We indicate the ablation study by: (A) the usage of the proposed GAN that distinguishes real or fake image-caption pairs, (B) pseudo-labeling, and (C) noise handling by sample re-weighting. We also compare with [Gu et al, 2018] and [Feng et al, 2019], which are trained with unpaired datasets.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…We also compare the recent unpaired image captioning methods [Gu et al, 2018;Feng et al, 2019] in Table 1. Both of the methods are evaluated on MS COCO testset.…”
Section: As Shown Inmentioning
confidence: 99%
“…The unsupervised machine translation framework is also applied to various other tasks, e.g. image captioning (Feng et al, 2019), text style transfer (Zhang et al, 2018), speech to text translation (Bansal et al, 2017) and clinical text simplification (Weng et al, 2019). The UMT framework makes it possible to apply neural models to tasks where limited human labeled data is available.…”
Section: Related Workmentioning
confidence: 99%