2022
DOI: 10.1007/978-3-031-19772-7_29
|View full text |Cite
|
Sign up to set email alerts
|

ActionFormer: Localizing Moments of Actions with Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
113
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 210 publications
(114 citation statements)
references
References 65 publications
1
113
0
Order By: Relevance
“…The candidate moments are further examined by convolutional heads shared across pyramid levels for action classification and boundary regression, from which action segments are assembled and combined using multi-class SoftNMS [2]. We refer readers to the main paper [12] for more technical details.…”
Section: Splitmentioning
confidence: 99%
See 1 more Smart Citation
“…The candidate moments are further examined by convolutional heads shared across pyramid levels for action classification and boundary regression, from which action segments are assembled and combined using multi-class SoftNMS [2]. We refer readers to the main paper [12] for more technical details.…”
Section: Splitmentioning
confidence: 99%
“…This gap has recently been closed by EgoVLP [10], a dedicated egocentric pre-training method, following the release of the Ego4D dataset [7]. Meanwhile, our prior work of ActionFormer [12], a transformer-based backbone, recently established the state-of-the-art results for temporal action localization. ActionFormer adopts local self-attention for temporal reasoning and captures actions of variable length using a flexible point-based action representation.…”
Section: Introductionmentioning
confidence: 99%
“…One-stage methods perform action localization and classification simultaneously. Constrained by the huge computation cost for extracting features frame by frame, the one-stage methods [12,30] mainly use pre-extracted features as input. Recent work [9,14,29] also explores end-to-end training, which takes raw frames as input.…”
Section: Temporal Action Detectionmentioning
confidence: 99%
“…In recent years, with the breakthrough progress [10]- [12] of Transformer in computer vision, Transformer has shown superior performance in general image classification [10], image retrieval [13], and semantic segmentation [14]. The Visual Transformer (ViT) has proven its great potential in image classification by automatically identifying discriminating feature areas in images through its inherent attention mechanism.…”
Section: Introductionmentioning
confidence: 99%