2021
DOI: 10.48550/arxiv.2106.10271
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-end Temporal Action Detection with Transformer

Xiaolong Liu,
Qimeng Wang,
Yao Hu
et al.

Abstract: Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental task in video understanding and significant progress has been made in TAD. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. Here, we construct an end-toend framework for TAD upon Transformer, termed TadTR, which simultaneously predicts all action instances as a set of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
14
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 10 publications
(15 citation statements)
references
References 21 publications
1
14
0
Order By: Relevance
“…[46,47] propose graph-based methods, where they define proposals and snippets as graph nodes and perform graph convolutions for the information exchange. Our approach is closer to recent work that leverage the Transformer architecture [30,33,38]. Due to the rising popularity of transformers for vision tasks [3,10,16], [30,33,38] extended the transformer building blocks to the inner working of TAL as a way to infuse temporal context between proposals.…”
Section: Related Workmentioning
confidence: 89%
See 1 more Smart Citation
“…[46,47] propose graph-based methods, where they define proposals and snippets as graph nodes and perform graph convolutions for the information exchange. Our approach is closer to recent work that leverage the Transformer architecture [30,33,38]. Due to the rising popularity of transformers for vision tasks [3,10,16], [30,33,38] extended the transformer building blocks to the inner working of TAL as a way to infuse temporal context between proposals.…”
Section: Related Workmentioning
confidence: 89%
“…Our approach is closer to recent work that leverage the Transformer architecture [30,33,38]. Due to the rising popularity of transformers for vision tasks [3,10,16], [30,33,38] extended the transformer building blocks to the inner working of TAL as a way to infuse temporal context between proposals. In contrast to prior art, our work considers the interplay of multiple modalities, visual and audio, while also modeling the surrounding context of an action.…”
Section: Related Workmentioning
confidence: 89%
“…and TadTR [152] use transformers to model long-range dependencies. Among them RTD-Net [66] achieved the There are also two state-of-the-art (SOTA) methods that do not belong to the mentioned categories of methods.…”
Section: Fully-supervised Methodsmentioning
confidence: 99%
“…Transformer AGT [65], RTD-Net [66] ATAG [63], TadTR [152] + Modeling non-linear temporal structure and inter-proposal relationships for proposal generation. -High parametric complexity.…”
Section: Rnnsmentioning
confidence: 99%
“…Many other ViT variants [8,13,21,22,25,37,54,60,70] are proposed from then, which achieve promising performance compared with its counterpart CNNs for image analysis tasks [6,23,74]. Recently, some works introduce vision transformer for video understanding tasks such as action recognition [1,3,4,15,20,38,42], action detection [36,58,62,73], video superresolution [5], video inpainting [32,71], and 3D animation [9]. Some works [20,42] conduct temporal contextual modeling with transformer based on single-frame features from pretrained 2D networks, while other works [1,3,4,15,38] mine the spatio-temporal attentions via video transformer directly.…”
Section: Related Workmentioning
confidence: 99%