STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset

Yoshikawa, Yuya; Shigeto, Yutaro; Takeuchi, Akikazu

doi:10.18653/v1/p17-2066

Cited by 95 publications

(76 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Japanese corpus we use is based on the newly created STAIR dataset [6]. Using the same methodology as [2], [6] collected 5 Japanese captions for each image of the original MSCOCO dataset. As for the original MSCOCO dataset, Japanese captions were written by native Japanese speakers.…”

Section: English and Japanese Corporamentioning

confidence: 99%

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Havard

Chevrot

Besacier

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.Index Termsgrounded language learning, attention mechanism, cross-lingual speech retrieval, recurrent neural networks.

show abstract

Section: English and Japanese Corporamentioning

confidence: 99%

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Havard

Chevrot

Besacier

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…For image captioning, we utilize the multi30k (Elliott et al 2016), COCO (Chen et al 2015) and STAIR (Yoshikawa, Shigeto, and Takeuchi 2017) datasets. The multi30k dataset contains 30k images and annotations under two tasks.…”

Section: Datasetsmentioning

confidence: 99%

Unsupervised Bilingual Lexicon Induction from Mono-Lingual Multimodal Data

Chen

Jin

Hauptmann

2019

AAAI

View full text Add to dashboard Cite

Bilingual lexicon induction, translating words from the source language to the target language, is a long-standing natural language processing task. Recent endeavors prove that it is promising to employ images as pivot to learn the lexicon induction without reliance on parallel corpora. However, these vision-based approaches simply associate words with entire images, which are constrained to translate concrete words and require object-centered images. We humans can understand words better when they are within a sentence with context. Therefore, in this paper, we propose to utilize images and their associated captions to address the limitations of previous approaches. We propose a multi-lingual caption model trained with different mono-lingual multimodal data to map words in different languages into joint spaces. Two types of word representation are induced from the multi-lingual caption model: linguistic features and localized visual features. The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant. The localized visual feature is attended to the region in the image that correlates to the word, so that it alleviates the image restriction for salient visual representation. The two types of features are complementary for word translation. Experimental results on multiple language pairs demonstrate the effectiveness of our proposed method, which substantially outperforms previous vision-based approaches without using any parallel sentences or supervision of seed word pairs.(a) The previous vision-based approach: a word is represented by global features extracted from retrieved images. It requires object-centered images and is unreliable for non-concrete words.(b) Our proposed approach: the word representation is learned from both sentence contexts and visual localization.

show abstract

“…MS-COCO (Lin et al, 2014) contains 123'287 images and five English captions per image. Yoshikawa et al (2017) proposed a model which generates Japanese descriptions for images. We divide the dataset based on .…”

Section: Datasetsmentioning

confidence: 99%

“…Previous works in image-caption task and learning a joint embedding space for texts and images are mostly related to English language, however, recently there is a large amount of research in other languages due to the availability of multilingual datasets (Funaki and Nakayama, 2015;Rajendran et al, 2015;Miyazaki and Shimizu, 2016;Young et al, 2014;Hitschler and Riezler, 2016;Yoshikawa et al, 2017). The aim of these models is to map images and their captions in a single language into a joint embedding space (Rajendran et al, 2015;Calixto et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task

Mohammadshahi¹,

Lebret²,

Aberer³

2019

Proceedings of the Beyond Vision and LANguage: InTEgrating Real-World kNowledge (LANTERN)

View full text Add to dashboard Cite

In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions.

show abstract

STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset

Cited by 95 publications

References 16 publications

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Unsupervised Bilingual Lexicon Induction from Mono-Lingual Multimodal Data

Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task

Contact Info

Product

Resources

About