2022
DOI: 10.48550/arxiv.2202.07925
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ActionFormer: Localizing Moments of Actions with Transformers

Abstract: Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer-a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(11 citation statements)
references
References 61 publications
0
11
0
Order By: Relevance
“…Transformer [39] originated from natural language processing (NLP), and has been widely explored in vision tasks recently. Specifically, Transformer has also show a great potential in video analysis, e.g., action recognition [28,48], video restoration [25], video question answering [10], video instance segmentation [44] and etc. However, most spatiotemporal Transformer follow the de facto scheme of ViT [5], i.e., simply dividing an image into local patches and stacking global attention, which lacks sufficient exploration of the properties of visual signal, thus suffering from insufficiency token representation and explosive computation.…”
Section: Related Workmentioning
confidence: 99%
“…Transformer [39] originated from natural language processing (NLP), and has been widely explored in vision tasks recently. Specifically, Transformer has also show a great potential in video analysis, e.g., action recognition [28,48], video restoration [25], video question answering [10], video instance segmentation [44] and etc. However, most spatiotemporal Transformer follow the de facto scheme of ViT [5], i.e., simply dividing an image into local patches and stacking global attention, which lacks sufficient exploration of the properties of visual signal, thus suffering from insufficiency token representation and explosive computation.…”
Section: Related Workmentioning
confidence: 99%
“…TAD [5,11,30,50,67] is evaluated on interval-based metrics such as mAP @ temporal Intersection-over-Union (IoU) or at sub-sampled time points, neither of which enforce frame accuracy on the action boundaries. Down-sampling in time (up to 16×) is a common preprocessing step [3,38,39,48,66,70]. TAS [21,32,56] also optimizes interval-based metrics such as F1 @ temporal overlap.…”
Section: Related Workmentioning
confidence: 99%
“…Recent approaches for TAD [10,38,39,59,66,69] and TAS [1,7,20,29,53,68] often proceed in two stages: (1) feature extraction then (2) head learning for the end task. Fixed, pre-trained features from video classification on Kinetics-400 are often used for the first stage [2,6,63], and state-of-the-art TAD methods with these features [41,70,73] often perform comparably to if not better than recent end-to-end learning approaches [36,40]. Indirect fine-tuning using classification in the target domain is sometimes performed to improve feature encoding [2,48].…”
Section: Related Workmentioning
confidence: 99%
“…In order to model long-range context, some current works, such as the RTD-Net [ 5 ] and TadTR [ 28 ], regard video as a temporal sequence and introduce a self-attention transformer structure. Because using the attention mechanism in the whole sequence is inefficient and will introduce irrelevant noise interference, ActionFormer [ 4 ] proposed a local attention mechanism that limits the attention range within a fixed window. Considering the anchor base and anchor free have the advantages of stability and flexibility, respectively, the A2net [ 12 ] integrates these two methods into one framework to achieve complementary advantages.…”
Section: Related Workmentioning
confidence: 99%
“…To model long-range temporal dependencies, the commonly used methods are the stacked 1D temporal convolutions [ 1 , 2 , 3 ] and transformer [ 4 , 5 , 6 ]. However, limited by the kernel size, the former method can only capture the local scope context information, neither can learn the relationship between frames with distant temporal intervals, and it cannot establish the relationship between instances.…”
Section: Introductionmentioning
confidence: 99%