2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.225
|View full text |Cite
|
Sign up to set email alerts
|

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Abstract: Event detection in unconstrained videos is conceived as a content-based video retrieval with two modalities: textual and visual. Given a text describing a novel event, the goal is to rank related videos accordingly. This task is zero-exemplar, no video examples are given to the novel event.Related works train a bank of concept detectors on external data sources. These detectors predict confidence scores for test videos, which are ranked and retrieved accordingly. In contrast, we learn a joint space in which th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
7
2

Relationship

3
6

Authors

Journals

citations
Cited by 11 publications
(9 citation statements)
references
References 35 publications
0
9
0
Order By: Relevance
“…Recently, there is a major interest in understanding long-range activities, which brings news challenges. The reason is that these activities are com-plex [2], take longer to unfold [1] and are harder to model their temporal structure [3,20]. New benchmarks are proposed, as Charades [12], Epic-Kitchens [21], Breakfast [1], MultiThumos [13,22], YouCook [23] or Tasty [24].…”
Section: Related Workmentioning
confidence: 99%
“…Recently, there is a major interest in understanding long-range activities, which brings news challenges. The reason is that these activities are com-plex [2], take longer to unfold [1] and are harder to model their temporal structure [3,20]. New benchmarks are proposed, as Charades [12], Epic-Kitchens [21], Breakfast [1], MultiThumos [13,22], YouCook [23] or Tasty [24].…”
Section: Related Workmentioning
confidence: 99%
“…Often, a pooling layer is very common towards the end of deep CNNs. There exist various pooling mechanisms in literature like Average or Max Pooling [26], [27], Attention Pooling [28], Rank-Pooling [29] and High-Dimensional Feature encoding [30]. The goal of pooling is to select the most important features and reduce the network size so that the model doesn't over-fit.…”
Section: Activity Recognition -Learn-able Poolingsmentioning
confidence: 99%
“…Traditional Human Activity Recognition: Recent surge of deep learning has significantly influenced the advancement in recognizing human activities from videos. Most attempts in this genre are usually derived from the imagebased networks, which are used to extract features from individual frames and extended them to perform temporal integration by forming a fixed size descriptor using statistical pooling such as max and average pooling [16,13], attentional pooling [11], rank pooling [9], context gating [33] and high-dimensional feature encoding [12,55]. However, an important visual cue representing the temporal pattern is overlooked in such statistical pooling and highdimensional encoding.…”
Section: Related Work and Motivationmentioning
confidence: 99%