2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.64
|View full text |Cite
|
Sign up to set email alerts
|

Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner

Abstract: Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain cr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
83
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 136 publications
(83 citation statements)
references
References 28 publications
0
83
0
Order By: Relevance
“…Thus, methods developed on such datasets might not be easily adopted in the wild. Nevertheless, great efforts have been made to extend captioning to out-of-domain data [3,9,69] or different styles beyond mere factual descriptions [22,55]. In this work we explore unsupervised captioning, where image and language sources are independent.…”
Section: Language Domainmentioning
confidence: 99%
“…Thus, methods developed on such datasets might not be easily adopted in the wild. Nevertheless, great efforts have been made to extend captioning to out-of-domain data [3,9,69] or different styles beyond mere factual descriptions [22,55]. In this work we explore unsupervised captioning, where image and language sources are independent.…”
Section: Language Domainmentioning
confidence: 99%
“…To suppress high variance of Monte-Carlo sampling, Self-critical Sequential Training (SCST) [39] utilizes a baseline subtracted from the return which is added to reduce the variance of gradient estimation. Rather than obtaining a single reward at the end of sampling, actor-critic based algorithms (e.g., Embedded Reward [38], Actor-Critic [55], Adapt [9], HAL [46]) learn both a policy and a state-value function ("crtic"), which is used for bootstrapping, i.e., updating a state from subsequent estimation, to reduce variance and accelerate learning [41]. Different from existing work, the proposed CRL algorithm learns about a critic from the inner environment, complementing the extrinsic reward from the perspective of agent learning.…”
Section: Related Work 21 Sentence-level Captioning With Reinforcemenmentioning
confidence: 99%
“…In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22].…”
Section: Deep Image Captioningmentioning
confidence: 99%
“…Semantic concept analysis, or attribute prediction [17,21], is a task closely related to image captioning, because attributes can be interpreted as a basis for descriptions. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. Latent topics [6], cross domains [22], and inter-attribute correlations [12] are considered to improve the results.…”
Section: Deep Image Captioningmentioning
confidence: 99%
See 1 more Smart Citation