2014
DOI: 10.1007/978-3-319-10593-2_35
|View full text |Cite
|
Sign up to set email alerts
|

Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
167
0

Year Published

2015
2015
2020
2020

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 240 publications
(168 citation statements)
references
References 27 publications
1
167
0
Order By: Relevance
“…In addition, all 3, 361 image-text pairs from the BBC News Corpora and 2, 999 image-text pairs from the SimpleWiki dataset have been included. From this randomly shuffled corpus, samples have been selected to generate a disjoint split of 190, 202 training and 6, 270 validation samples 8 . The image encoding network has been initialized with weights of a pre-trained InceptionV3 model.…”
Section: Methodsmentioning
confidence: 99%
“…In addition, all 3, 361 image-text pairs from the BBC News Corpora and 2, 999 image-text pairs from the SimpleWiki dataset have been included. From this randomly shuffled corpus, samples have been selected to generate a disjoint split of 190, 202 training and 6, 270 validation samples 8 . The image encoding network has been initialized with weights of a pre-trained InceptionV3 model.…”
Section: Methodsmentioning
confidence: 99%
“…dataset (Young et al 2014), a popular benchmark for caption generation and retrieval that has been used, among others, by Chen and Zitnick (2015); Donahue et al (2015); Fang et al (2015); Gong et al (2014b); Karpathy et al (2014); Karpathy and Fei-Fei (2015); Kiros et al (2014); Klein et al (2014); Lebret et al (2015); Mao et al (2015); Vinyals et al (2015); Xu et al (2015). Flickr30k contains 31,783 images focusing mainly on people and animals, and 158,915 English captions (five per image).…”
Section: Figmentioning
confidence: 99%
“…This approach learns an embedding of region and phrase features to a shared latent space and uses distance in that space to retrieve image regions given a phrase. While there have been several neural network-based approaches for learning such embeddings (Karpathy and Fei-Fei 2015;Kiros et al 2014;Mao et al 2015), using state-of-the-art text and image features with Canonical Correlation Analysis (CCA) (Hotelling 1936) continues to produce remarkable results (Gong et al 2014b;Klein et al 2014;Lev et al 2016), and is also much faster to train than a neural network. Given two sets of matching features from different views (in our case, image and text features), CCA finds linear projections of both views into a joint space of common dimensionality in which the correlation between the views is maximized.…”
Section: Region-phrase Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…Our corpus has several unique properties to complement existing corpora. As explored in a very recent work of (Gong et al, 2014), we expect that it is possible to combine crowd-sourced and web-harvested datasets and achieve the best of both worlds.…”
Section: Related Workmentioning
confidence: 99%