End-to-end Temporal Action Detection with Transformer

Liu, Xiaolong; Wang, Qimeng; Hu, Yao; Tang, Xu; Zhang, Shiwei; Bai, Song; Bai, Xiang

doi:10.48550/arxiv.2106.10271

Cited by 10 publications

(15 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[46,47] propose graph-based methods, where they define proposals and snippets as graph nodes and perform graph convolutions for the information exchange. Our approach is closer to recent work that leverage the Transformer architecture [30,33,38]. Due to the rising popularity of transformers for vision tasks [3,10,16], [30,33,38] extended the transformer building blocks to the inner working of TAL as a way to infuse temporal context between proposals.…”

Section: Related Workmentioning

confidence: 89%

“…Our approach is closer to recent work that leverage the Transformer architecture [30,33,38]. Due to the rising popularity of transformers for vision tasks [3,10,16], [30,33,38] extended the transformer building blocks to the inner working of TAL as a way to infuse temporal context between proposals. In contrast to prior art, our work considers the interplay of multiple modalities, visual and audio, while also modeling the surrounding context of an action.…”

Section: Related Workmentioning

confidence: 89%

See 1 more Smart Citation

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Ramazanova¹,

Escorcia²,

Heilbron³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 89%

Section: Related Workmentioning

confidence: 89%

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Ramazanova¹,

Escorcia²,

Heilbron³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…and TadTR [152] use transformers to model long-range dependencies. Among them RTD-Net [66] achieved the There are also two state-of-the-art (SOTA) methods that do not belong to the mentioned categories of methods.…”

Section: Fully-supervised Methodsmentioning

confidence: 99%

“…Transformer AGT [65], RTD-Net [66] ATAG [63], TadTR [152] + Modeling non-linear temporal structure and inter-proposal relationships for proposal generation. -High parametric complexity.…”

Section: Rnnsmentioning

confidence: 99%

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper also reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.

show abstract

“…Many other ViT variants [8,13,21,22,25,37,54,60,70] are proposed from then, which achieve promising performance compared with its counterpart CNNs for image analysis tasks [6,23,74]. Recently, some works introduce vision transformer for video understanding tasks such as action recognition [1,3,4,15,20,38,42], action detection [36,58,62,73], video superresolution [5], video inpainting [32,71], and 3D animation [9]. Some works [20,42] conduct temporal contextual modeling with transformer based on single-frame features from pretrained 2D networks, while other works [1,3,4,15,38] mine the spatio-temporal attentions via video transformer directly.…”

Section: Related Workmentioning

confidence: 99%

PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

Yu¹,

Shen²,

Shi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications (e.g., remote healthcare and affective computing). Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. Furthermore, we also propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra-and cross-dataset testings. One highlight is that, unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community. The codes will be released at https://github.com/ZitongYu/PhysFormer.

show abstract

End-to-end Temporal Action Detection with Transformer

Cited by 10 publications

References 21 publications

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

Contact Info

Product

Resources

About