Unsupervised Image Captioning

Feng, Yang; Ma, Lin; Liu, Wei; Luo, Jiebo

doi:10.1109/cvpr.2019.00425

Cited by 210 publications

(224 citation statements)

References 32 publications

Supporting

Mentioning

223

Contrasting

Unclassified

Order By: Relevance

“…When language and images come from different sources, some weak supervisory signal is needed to align the manifold of visual concepts to the textual domain. Similar to previous work [18], we use a pre-trained object detector to generate an initial noisy alignment between the text source and visual entities that can be detected in the image.…”

Section: Language Domainmentioning

confidence: 99%

See 1 more Smart Citation

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images and their captions. The core component of our approach is a shared latent space that is structured by visual concepts. In this space, the two modalities should be indistinguishable. A language model is first trained to encode sentences into semantically structured embeddings. Image features that are translated into this embedding space can be decoded into descriptions through the same language model, similarly to sentence embeddings. This translation is learned from weakly paired images and text using a loss robust to noisy assignments and a conditional adversarial component. Our approach allows to exploit large text corpora outside the annotated distributions of image/caption data. Our experiments show that the proposed domain alignment learns a semantically meaningful representation which outperforms previous work.

show abstract

Section: Language Domainmentioning

confidence: 99%

“…Most closely related to our work is [18] which does not require any image-sentence pairs. In this case, it is optimal to use a language domain which is rich in visual concepts.…”

Section: Related Workmentioning

confidence: 99%

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…We indicate the ablation study by: (A) the usage of the proposed GAN that distinguishes real or fake image-caption pairs, (B) pseudo-labeling, and (C) noise handling by sample re-weighting. We also compare with [Gu et al, 2018] and [Feng et al, 2019], which are trained with unpaired datasets.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…We also compare the recent unpaired image captioning methods [Gu et al, 2018;Feng et al, 2019] in Table 1. Both of the methods are evaluated on MS COCO testset.…”

Section: As Shown Inmentioning

confidence: 99%

Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Kim¹,

Choi²,

Oh³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Constructing an organized dataset comprised of a large number of images and several captions for each image is a laborious task, which requires vast human effort. On the other hand, collecting a large number of images and sentences separately may be immensely easier. In this paper, we develop a novel dataefficient semi-supervised framework for training an image captioning model. We leverage massive unpaired image and caption data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples via Generative Adversarial Networks to learn the joint distribution of image and caption. To evaluate, we construct scarcely-paired COCO dataset, a modified version of MS COCO caption dataset. The empirical results show the effectiveness of our method compared to several strong baselines, especially when the amount of the paired samples are scarce.

show abstract

“…The unsupervised machine translation framework is also applied to various other tasks, e.g. image captioning (Feng et al, 2019), text style transfer (Zhang et al, 2018), speech to text translation (Bansal et al, 2017) and clinical text simplification (Weng et al, 2019). The UMT framework makes it possible to apply neural models to tasks where limited human labeled data is available.…”

Section: Related Workmentioning

confidence: 99%

Generating Classical Chinese Poems from Vernacular Chinese

Yang

Cai

Feng

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Classical Chinese poetry is a jewel in the treasure house of Chinese culture. Previous poem generation models only allow users to employ keywords to interfere the meaning of generated poems, leaving the dominion of generation to the model. In this paper, we propose a novel task of generating classical Chinese poems from vernacular, which allows users to have more control over the semantic of generated poems. We adapt the approach of unsupervised machine translation (UMT) to our task. We use segmentation-based padding and reinforcement learning to address undertranslation and over-translation respectively. According to experiments, our approach significantly improve the perplexity and BLEU compared with typical UMT models. Furthermore, we explored guidelines on how to write the input vernacular to generate better poems. Human evaluation showed our approach can generate high-quality poems which are comparable to amateur poems.

show abstract

Unsupervised Image Captioning

Cited by 210 publications

References 32 publications

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Generating Classical Chinese Poems from Vernacular Chinese

Contact Info

Product

Resources

About