2018
DOI: 10.1007/978-3-030-04212-7_37
|View full text |Cite
|
Sign up to set email alerts
|

Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
2
2
2

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…In particular, we adopt a blockbased processing strategy to obtain a video-level classification score, and propose a novel metric function as a similarity measure between activity portions of the videos. Su et al [38] proposed shot-based sampling instead of uniform sampling and designed a multi-stage temporal pooling network for action localization. Zeng et al [49] proposed an iterative training strategy to use not only the most discriminative action instances but also the less discriminative ones.…”
Section: Related Workmentioning
confidence: 99%
“…In particular, we adopt a blockbased processing strategy to obtain a video-level classification score, and propose a novel metric function as a similarity measure between activity portions of the videos. Su et al [38] proposed shot-based sampling instead of uniform sampling and designed a multi-stage temporal pooling network for action localization. Zeng et al [49] proposed an iterative training strategy to use not only the most discriminative action instances but also the less discriminative ones.…”
Section: Related Workmentioning
confidence: 99%
“…Given an input video X v , we form unit sequence U and extract corresponding feature sequence F with length l u . Since an untrimmed video usually comes up with extremely irrelevant frames with action instances only occupying small parts, we totally test three sampling methods to simplify the feature sequence for computational cost reduction: (1) uniform sampling: units are extracted with a regular interval σ from U , thus the final unit sequence and feature sequence are U = {u j } l u j=1 and F separately, where l u = lu σ ; (2) sparse sampling: we first divide the video into P segments {S 1 , S 2 , ..., S P } with equal length, then during each training epoch, we randomly sample one unit from each segment to form the unit sequence of length P ; (3) shot-based sampling: considering the action structure, we sample the unit sequence U based on action shots, which are generated by shot boundary detector [28]. Evaluation results of these sampling methods are shown in Section 4.3.…”
Section: Training Of Cpmnmentioning
confidence: 99%