Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner

Chen, Tseng-Hung; Liao, Yuan-Hong; Chuang, Ching-Yao; Hsu, Wan‐Ting; Fu, Jianlong; Sun, Min

doi:10.1109/iccv.2017.64

Cited by 136 publications

(83 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, methods developed on such datasets might not be easily adopted in the wild. Nevertheless, great efforts have been made to extend captioning to out-of-domain data [3,9,69] or different styles beyond mere factual descriptions [22,55]. In this work we explore unsupervised captioning, where image and language sources are independent.…”

Section: Language Domainmentioning

confidence: 99%

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images and their captions. The core component of our approach is a shared latent space that is structured by visual concepts. In this space, the two modalities should be indistinguishable. A language model is first trained to encode sentences into semantically structured embeddings. Image features that are translated into this embedding space can be decoded into descriptions through the same language model, similarly to sentence embeddings. This translation is learned from weakly paired images and text using a loss robust to noisy assignments and a conditional adversarial component. Our approach allows to exploit large text corpora outside the annotated distributions of image/caption data. Our experiments show that the proposed domain alignment learns a semantically meaningful representation which outperforms previous work.

show abstract

Section: Language Domainmentioning

confidence: 99%

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…To suppress high variance of Monte-Carlo sampling, Self-critical Sequential Training (SCST) [39] utilizes a baseline subtracted from the return which is added to reduce the variance of gradient estimation. Rather than obtaining a single reward at the end of sampling, actor-critic based algorithms (e.g., Embedded Reward [38], Actor-Critic [55], Adapt [9], HAL [46]) learn both a policy and a state-value function ("crtic"), which is used for bootstrapping, i.e., updating a state from subsequent estimation, to reduce variance and accelerate learning [41]. Different from existing work, the proposed CRL algorithm learns about a critic from the inner environment, complementing the extrinsic reward from the perspective of agent learning.…”

Section: Related Work 21 Sentence-level Captioning With Reinforcemenmentioning

confidence: 99%

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Luo

Huang

Zhang

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual paragraph generation aims to automatically describe a given image from different perspectives and organize sentences in a coherent way. In this paper, we address three critical challenges for this task in a reinforcement learning setting: the mode collapse, the delayed feedback, and the time-consuming warm-up for policy networks. Generally, we propose a novel Curiosity-driven Reinforcement Learning (CRL) framework to jointly enhance the diversity and accuracy of the generated paragraphs. First, by modeling the paragraph captioning as a long-term decision-making process and measuring the prediction uncertainty of state transitions as intrinsic rewards, the model is incentivized to memorize precise but rarely spotted descriptions to context, rather than being biased towards frequent fragments and generic patterns. Second, since the extrinsic reward from evaluation is only available until the complete paragraph is generated, we estimate its expected value at each time step with temporal-difference learning, by considering the correlations between successive actions. Then the estimated extrinsic rewards are complemented by dense intrinsic rewards produced from the derived curiosity module, in order to encourage the policy to fully explore action space and find a global optimum. Third, discounted imitation learning is integrated for learning from human demonstrations, without separately performing the timeconsuming warm-up in advance. Extensive experiments conducted on the Standford image-paragraph dataset demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 38.4% compared with state-of-the-art.

show abstract

“…In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22].…”

Section: Deep Image Captioningmentioning

confidence: 99%

“…Semantic concept analysis, or attribute prediction [17,21], is a task closely related to image captioning, because attributes can be interpreted as a basis for descriptions. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. Latent topics [6], cross domains [22], and inter-attribute correlations [12] are considered to improve the results.…”

Section: Deep Image Captioningmentioning

confidence: 99%

“…To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. Latent topics [6], cross domains [22], and inter-attribute correlations [12] are considered to improve the results. Meanwhile, some approaches [5,15,[17][18][19] have adopted multimodal embedding, which represents multiple aspects of objects with pictures and descriptions as the latent semantics of objects.…”

Section: Deep Image Captioningmentioning

confidence: 99%

See 1 more Smart Citation

Image classification and captioning model considering a CAM‐based disagreement loss

Yoon

Park

et al. 2019

ETRI Journal

View full text Add to dashboard Cite

Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end‐to‐end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.

show abstract

Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner

Cited by 136 publications

References 28 publications

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Image classification and captioning model considering a CAM‐based disagreement loss

Contact Info

Product

Resources

About