2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.227
|View full text |Cite
|
Sign up to set email alerts
|

ER3: A Unified Framework for Event Retrieval, Recognition and Recounting

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
23
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(23 citation statements)
references
References 37 publications
0
23
0
Order By: Relevance
“…Finally, we apply our approach on event retrieval. We compare against the Mean-MultiVLAD (MMV), obtained by averaging and 2 -normalizing Multi-VLAD frame descriptors, CTE [28], Stable hyper-pooling [6] and the recent Counting Grid Aggregation (CGA) [11]. LAMV, CTE and TMK are able to provide a good localization in addition to retrieval, the others can not.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, we apply our approach on event retrieval. We compare against the Mean-MultiVLAD (MMV), obtained by averaging and 2 -normalizing Multi-VLAD frame descriptors, CTE [28], Stable hyper-pooling [6] and the recent Counting Grid Aggregation (CGA) [11]. LAMV, CTE and TMK are able to provide a good localization in addition to retrieval, the others can not.…”
Section: Resultsmentioning
confidence: 99%
“…According to a comparative study on copy detection conducted in 2014 [19], the best methods were relying on local descriptors and framebased matching [18], even though temporal alignment is often needed later, for example to manually verify a copyright infringement. In contrast, the state of the art for particular event retrieval [6,11] exploits a single vector per video.…”
Section: Introductionmentioning
confidence: 99%
“…A straightforward approach to this is to aggregate/pool frame-level features into a single video-level representation on which subsequently one can calculate a similarity measure. Such video-level representations include global vectors [35,11,21], hash codes [30,23,31] and Bag-of-Words (BoW) [5,20,22]. However, this disregards the spatial and the temporal structure of the visual similarity, as aggregation of features is influenced by clutter and irrelevant content.…”
Section: Introductionmentioning
confidence: 99%
“…A simpler alternative is proposed in this paper as the "temporal Detection By spatial Segmentation" (DBS) framework, which circumvents the variable lengths input difficulty. As illustrated in Figure 1, temporal action detection is recast into spatial semantic segmentation with the video imprint representation (Gao et al 2017b;2018), by aligning video frames into a fixed-size tensor feature. Such representation captures statistical characteristics while suppressing redundancies.…”
Section: Introductionmentioning
confidence: 99%
“…Then, Figure 1: Overview of the temporal detection by spatial segmentation framework for temporal action detection. Based on the video imprint (Gao et al 2017b;2018), video frames are nonlinearly projected into the video imprint, as visualized at the bottom left "black cube". FCN-based network is utilized to segment the imprint representation and the corresponding prediction score maps are visualized as a "blue cube" at the bottom center.…”
Section: Introductionmentioning
confidence: 99%