2017
DOI: 10.1109/tpami.2016.2627563
|View full text |Cite
|
Sign up to set email alerts
|

Video2vec Embeddings Recognize Events When Examples Are Scarce

Abstract: Abstract-This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call Video2vec, the correlations betwee… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
58
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 71 publications
(59 citation statements)
references
References 72 publications
(135 reference statements)
1
58
0
Order By: Relevance
“…It is convenient because it is fully automatic. We note that the recent work [58] proposed a different approach for event detection when the number of labeled training examples is limited. A potentially superior approach is to extract relevant keyframes from the positive training exemplars and use these to define saliency.…”
Section: Prioritization Using Semantic Saliencymentioning
confidence: 99%
“…It is convenient because it is fully automatic. We note that the recent work [58] proposed a different approach for event detection when the number of labeled training examples is limited. A potentially superior approach is to extract relevant keyframes from the positive training exemplars and use these to define saliency.…”
Section: Prioritization Using Semantic Saliencymentioning
confidence: 99%
“…In test time, video ranking and retrieval is done using distance metric between the projected test video y t and test query representation y. [16,17] project the visual feature x of a web video v into term-vector representation y of the video's textual title t. However, during training, the model makes use of the text query of the test events to learn better term-vector representation. Consequently, this limits the generalization for novel event queries.…”
Section: Related Workmentioning
confidence: 99%
“…In the middle, we borrow the network fT to embed the event article's feature y t as z t ∈ Z. Then, at the bottom, the network fV learns to embed the video feature x as z v ∈ Z such that the distance between z v , z t is minimized, in the learned metric space Z. self-paced reranking [23], pseudo-relevance feedback [24], event query manual intervention [25], early fusion of features (action [26,27,28,29,30] or acoustic [31,32,33]) or late fusion of concept scores [17]. All these contributions may be applied to our method.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations