ActionFormer: Localizing Moments of Actions with Transformers

Zhang, Chenlin; Wu, Jianxin; Li, Yin

doi:10.48550/arxiv.2202.07925

Cited by 5 publications

(11 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transformer [39] originated from natural language processing (NLP), and has been widely explored in vision tasks recently. Specifically, Transformer has also show a great potential in video analysis, e.g., action recognition [28,48], video restoration [25], video question answering [10], video instance segmentation [44] and etc. However, most spatiotemporal Transformer follow the de facto scheme of ViT [5], i.e., simply dividing an image into local patches and stacking global attention, which lacks sufficient exploration of the properties of visual signal, thus suffering from insufficiency token representation and explosive computation.…”

Section: Related Workmentioning

confidence: 99%

Video-based Human-Object Interaction Detection from Tubelet Tokens

Tu¹,

Sun²,

Min³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of 16.14% on VidHOI and a 2 points gain on CAD-120 as well as a 4× speedup.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Video-based Human-Object Interaction Detection from Tubelet Tokens

Tu¹,

Sun²,

Min³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…TAD [5,11,30,50,67] is evaluated on interval-based metrics such as mAP @ temporal Intersection-over-Union (IoU) or at sub-sampled time points, neither of which enforce frame accuracy on the action boundaries. Down-sampling in time (up to 16×) is a common preprocessing step [3,38,39,48,66,70]. TAS [21,32,56] also optimizes interval-based metrics such as F1 @ temporal overlap.…”

Section: Related Workmentioning

confidence: 99%

“…Recent approaches for TAD [10,38,39,59,66,69] and TAS [1,7,20,29,53,68] often proceed in two stages: (1) feature extraction then (2) head learning for the end task. Fixed, pre-trained features from video classification on Kinetics-400 are often used for the first stage [2,6,63], and state-of-the-art TAD methods with these features [41,70,73] often perform comparably to if not better than recent end-to-end learning approaches [36,40]. Indirect fine-tuning using classification in the target domain is sometimes performed to improve feature encoding [2,48].…”

Section: Related Workmentioning

confidence: 99%

Spotting Temporally Precise, Fine-Grained Events in Video

Hong¹,

Zhang²,

Gharbi³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce the task of spotting temporally precise, finegrained events in video (detecting the precise moment in time events occur). Precise spotting requires models to reason globally about the full-time scale of actions and locally to identify subtle frame-to-frame appearance and motion differences that identify events during these actions. Surprisingly, we find that top performing solutions to prior video understanding tasks such as action detection and segmentation do not simultaneously meet both requirements. In response, we propose E2E-Spot, a compact, end-to-end model that performs well on the precise spotting task and can be trained quickly on a single GPU. We demonstrate that E2E-Spot significantly outperforms recent baselines adapted from the video action detection, segmentation, and spotting literature to the precise spotting task. Finally, we contribute new annotations and splits to several fine-grained sports action datasets to make these datasets suitable for future work on precise spotting.

show abstract

“…In order to model long-range context, some current works, such as the RTD-Net [ 5 ] and TadTR [ 28 ], regard video as a temporal sequence and introduce a self-attention transformer structure. Because using the attention mechanism in the whole sequence is inefficient and will introduce irrelevant noise interference, ActionFormer [ 4 ] proposed a local attention mechanism that limits the attention range within a fixed window. Considering the anchor base and anchor free have the advantages of stability and flexibility, respectively, the A2net [ 12 ] integrates these two methods into one framework to achieve complementary advantages.…”

Section: Related Workmentioning

confidence: 99%

“…To model long-range temporal dependencies, the commonly used methods are the stacked 1D temporal convolutions [ 1 , 2 , 3 ] and transformer [ 4 , 5 , 6 ]. However, limited by the kernel size, the former method can only capture the local scope context information, neither can learn the relationship between frames with distant temporal intervals, and it cannot establish the relationship between instances.…”

Section: Introductionmentioning

confidence: 99%

Non-Local Temporal Difference Network for Temporal Action Detection

Han

Zhong

et al. 2022

Sensors

View full text Add to dashboard Cite

As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked convolutional block strategy to model long temporal structures. However, most of the information between adjacent frames is redundant, and distant information is weakened after multiple convolution operations. In addition, the durations of action instances vary widely, making it difficult for single-scale modeling to fit complex video structures. To address this issue, we propose a non-local temporal difference network (NTD), including a chunk convolution (CC) module, a multiple temporal coordination (MTC) module, and a temporal difference (TD) module. The TD module adaptively enhances the motion information and boundary features with temporal attention weights. The CC module evenly divides the input sequence into N chunks, using multiple independent convolution blocks to simultaneously extract features from neighboring chunks. Therefore, it realizes the information delivered from distant frames while avoiding trapping into the local convolution. The MTC module designs a cascade residual architecture, which realizes the multiscale temporal feature aggregation without introducing additional parameters. The NTD achieves a state-of-the-art performance on two large-scale datasets, 36.2% mAP@avg and 71.6% mAP@0.5 on ActivityNet-v1.3 and THUMOS-14, respectively.

show abstract

ActionFormer: Localizing Moments of Actions with Transformers

Cited by 5 publications

References 61 publications

Video-based Human-Object Interaction Detection from Tubelet Tokens

Video-based Human-Object Interaction Detection from Tubelet Tokens

Spotting Temporally Precise, Fine-Grained Events in Video

Non-Local Temporal Difference Network for Temporal Action Detection

Contact Info

Product

Resources

About