2020
DOI: 10.1109/tip.2020.3016486
|View full text |Cite
|
Sign up to set email alerts
|

Revisiting Anchor Mechanisms for Temporal Action Localization

Abstract: Most of the current action localization methods follow an anchor-based pipeline: depicting action instances by pre-defined anchors, learning to select the anchors closest to the ground truth, and predicting the confidence of anchors with refinements. Pre-defined anchors set prior about the location and duration for action instances, which facilitates the localization for common action instances but limits the flexibility for tackling action instances with drastic varieties, especially for extremely short or ex… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
49
1

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 169 publications
(50 citation statements)
references
References 50 publications
0
49
1
Order By: Relevance
“…For action localisation, we follow a two-stage paradigm: class-agnostic action proposal detection and proposal classification. To obtain high-quality action proposals, we first divide the entire video into equal-frame snippets; then use the CLIP image encoder with one Transformer layer to extract frame-wise embeddings for each snippet; and finally feed these embeddings to the off-the-shelf proposal detectors [37,74]. These detectors construct the feature pyramid, and make predictions in parallel, to determine actionness, centerness, and boundaries.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 2 more Smart Citations
“…For action localisation, we follow a two-stage paradigm: class-agnostic action proposal detection and proposal classification. To obtain high-quality action proposals, we first divide the entire video into equal-frame snippets; then use the CLIP image encoder with one Transformer layer to extract frame-wise embeddings for each snippet; and finally feed these embeddings to the off-the-shelf proposal detectors [37,74]. These detectors construct the feature pyramid, and make predictions in parallel, to determine actionness, centerness, and boundaries.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…These detectors construct the feature pyramid, and make predictions in parallel, to determine actionness, centerness, and boundaries. Please refer to [37,74] for detailed detector architectures and optimisations. Note that, our method is flexible to the choice of proposal detectors, and we do not innovate on such candidate proposal procedures.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…The recent success of convolutional neural networks (CNNs) in the video analysis [86] domain is commendable and has replaced statistical image processing, providing automated activity recognition [87,88], data prioritization [89], and many other useful tasks. The most effective methods for video surveillance are based on CNNs or their variants to classify abnormal actions/activities.…”
Section: Applying Fuzzy Logic: Why When and Where?mentioning
confidence: 99%
“…While promising results have been obtained, a limitation is that they usually assume the video has been trimmed and aligned with text description. The earlier work for automatically tailoring videos is temporal action localization [14,31], which is to localize interesting actions in a video from a given set of actions. However, the predefined action set is usually small and unrealistic, which is far from actual need.…”
Section: Introductionmentioning
confidence: 99%