2013
DOI: 10.1613/jair.3994
|View full text |Cite
|
Sign up to set email alerts
|

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics

Abstract: The ability to associate images with natural language sentences that describe what is depicted in them is a hallmark of image understanding, and a prerequisite for applications such as sentence-based image search. In analogy to image search, we propose to frame sentence-based image annotation as the task of ranking a given pool of captions. We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

1
854
0
5

Year Published

2015
2015
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 1,082 publications
(903 citation statements)
references
References 54 publications
1
854
0
5
Order By: Relevance
“…We show that the combination of the basic BoW text kernel and a high level image kernel based on the probabilities given by visual classifiers outperforms the best combination in [17] in most evaluation metrics (Section 5.1); -We demonstrate that visual classifiers trained for synsets included in the evaluation dataset improve the retrieval scores by a factor of two compared to the best method in [17] (Section 5.2); -Finally, in contrast to lexical similarities computed using text corpora, we propose to use the high level visual information to learn a lexical similarity, and show that the BoW text kernel enriched with such lexical similarity further boosts the performance (Section 6).…”
Section: Introductionmentioning
confidence: 96%
See 4 more Smart Citations
“…We show that the combination of the basic BoW text kernel and a high level image kernel based on the probabilities given by visual classifiers outperforms the best combination in [17] in most evaluation metrics (Section 5.1); -We demonstrate that visual classifiers trained for synsets included in the evaluation dataset improve the retrieval scores by a factor of two compared to the best method in [17] (Section 5.2); -Finally, in contrast to lexical similarities computed using text corpora, we propose to use the high level visual information to learn a lexical similarity, and show that the BoW text kernel enriched with such lexical similarity further boosts the performance (Section 6).…”
Section: Introductionmentioning
confidence: 96%
“…In [17] the problem of generating natural language description for a given image is relaxed to one of ranking a set of humanwritten captions, by assuming the set contains the original (human-written) caption of the image. [17] builds a dataset (dubbed Flickr8K) of image and caption pairs, and employs the kernel canonical correlation analysis (KCCA) [18,19] to learn a latent space in which a similarity measure between an image and a caption is defined. KCCA requires two kernels to be built, one for the images and the other for the captions.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations