OadTR: Online Action Detection with Transformers

Xiang, Wang; Zhang, Shiwei; Qing, Zhiwu; Shao, Yuanjie; Zuo, Zhang; Gao, Changxin

doi:10.1109/iccv48922.2021.00747

Cited by 93 publications

(60 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Earlier NLP [2], [47] and image-based works [7] introduced the use of GeLU [75] (instead of ReLU) as the activation function for the hidden layer in the FF sub-layer. This trend has been followed by a number of video works [9], [49], [64], [67], [71], [76] (see Tab. 1).…”

Section: Activation In Ffnmentioning

confidence: 99%

“…Some tasks focusing on frame-level predictions (such as video summarization [40]) may not require finer patch-level based granularity. Many VTs leverage frames as tokens (e.g., [14], [57], [59], [66], [67], [69], [93], [114]), achieving a good balance between computational cost and performance.…”

Section: Tokenizationmentioning

confidence: 99%

“…On the one hand, [CLS] is used to aggregate information from all the tokens into a single vector-form representation [11]. Although using a particular token instead of [CLS] is possible, this biases the predictions towards it [67]. Alternatively one could use avg/sum/max pooling [68], [69], [70], or weighted token summation [71].…”

Section: Transformer Trends Adopted For Videomentioning

confidence: 99%

“…Afterwards, Transformer layers are trained for a downstream task on those features. With this approach, many works [52], [67], [78], [81], [86], [95], [101], [136], [141], [171] are still able to train the Transformer on small datasets (<10k training samples). However, it is definitely common to use medium to large datasets, as in [53], [54], [56], [57], [58], [59], [66], [93], [172].…”

Section: Training Regimementioning

confidence: 99%

“…Transformers have also been applied for action anticipation [67], [80], sign-language translation [78], [136], visualquestion answering [55], [138], autonomous driving [218], robot navigation [134], visual-language navigation [66], personality recognition [142], lip reading [137], dynamic scene graph generation [131], and multimedia recomendation [141]. As not many video Transformers have tackled this, it is too early to ascertain specific trends, so we simply list them here for completeness.…”

Section: S29 Other Tasksmentioning

confidence: 99%

See 4 more Smart Citations

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

show abstract

Section: Activation In Ffnmentioning

confidence: 99%

Section: Tokenizationmentioning

confidence: 99%

Section: Transformer Trends Adopted For Videomentioning

confidence: 99%

Section: Training Regimementioning

confidence: 99%

Section: S29 Other Tasksmentioning

confidence: 99%

See 3 more Smart Citations

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Key Technology of Automation Control Based on Artificial Intelligence Technology

Wang

2021

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github. com/alibaba-mmai-research/MoLo.

show abstract

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection

Guo

Ren

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

OadTR: Online Action Detection with Transformers

Cited by 93 publications

References 33 publications

Video Transformers: A Survey

Video Transformers: A Survey

Key Technology of Automation Control Based on Artificial Intelligence Technology

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection

Contact Info

Product

Resources

About