2014 IEEE Conference on Computer Vision and Pattern Recognition 2014
DOI: 10.1109/cvpr.2014.341
|View full text |Cite
|
Sign up to set email alerts
|

Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts

Abstract: Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of performing high-level event detection with no training exemplars, using only textual descriptions. This task goes beyond the traditional zero-shot framework of adapting a given set of classes with training data to unseen classes. We leverage video and image col… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
157
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 103 publications
(158 citation statements)
references
References 28 publications
1
157
0
Order By: Relevance
“…As shown in Table VI, our results have a mean average precision of 28.79%, which is a significant gain of 6.5% over the state-of-the-art approach [15], which also curates a training set from the internet. We also compare with other recent methods: 20.80% [16], 16.7 [9], 16.1 [7], 11.89% [37], 8.86 [46], 6.39% [10], 6.12% [44], 2.5% [14], 2.3% [3]. We outperform all these methods significantly, due to the following key differences.…”
Section: E Comparison To the State Of The Artmentioning
confidence: 90%
See 3 more Smart Citations
“…As shown in Table VI, our results have a mean average precision of 28.79%, which is a significant gain of 6.5% over the state-of-the-art approach [15], which also curates a training set from the internet. We also compare with other recent methods: 20.80% [16], 16.7 [9], 16.1 [7], 11.89% [37], 8.86 [46], 6.39% [10], 6.12% [44], 2.5% [14], 2.3% [3]. We outperform all these methods significantly, due to the following key differences.…”
Section: E Comparison To the State Of The Artmentioning
confidence: 90%
“…Improvements to this scheme include modeling a pair of words or n-grams to not only exploit the cooccurrences between words [27], but also disambiguate among the multiple meanings represented by individual words [5]. Works in this paradigm, where events are represented as a collection of concept responses, are increasingly leveraging weakly annotated data from the internet [3], [44]. The method in [46] builds a list of frequent words in the text metadata, which represent concepts, to prune videos downloaded from YouTube.…”
Section: Changing Vehicle Tirementioning
confidence: 99%
See 2 more Smart Citations
“…The advantage of word2vec over other semantic embedding methods is that the latent variables are transparent, because the words are represented in vector space with only a few hundred dimensions. Examples of other semantic embedding methods are Wu et al [2014] with their common lexicon layer and Habibian et al [2014b] with VideoStory and Jain et al [2015] with the embedding of text, actions and objects to classify actions.…”
Section: Concept Selectionmentioning
confidence: 99%