2022
DOI: 10.1109/tip.2022.3195321
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Temporal Action Detection With Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 138 publications
(26 citation statements)
references
References 64 publications
0
19
0
Order By: Relevance
“…As shown in Tab. 2, when using one modality as input, our model variants that only apply self-attention in the encoder outperform all compared TAL methods, where TadTR [25] and ActionFormer [48] also use an end-to-end transformer-based architecture. When using both audio and visual modalities, the performance of our model boosts significantly, e.g., +11.9% and +10.7% at the average mAP compared with our visual-only and audioonly variants, respectively.…”
Section: Results and Analysismentioning
confidence: 82%
See 2 more Smart Citations
“…As shown in Tab. 2, when using one modality as input, our model variants that only apply self-attention in the encoder outperform all compared TAL methods, where TadTR [25] and ActionFormer [48] also use an end-to-end transformer-based architecture. When using both audio and visual modalities, the performance of our model boosts significantly, e.g., +11.9% and +10.7% at the average mAP compared with our visual-only and audioonly variants, respectively.…”
Section: Results and Analysismentioning
confidence: 82%
“…By contrast, single-stage TAL localizes actions in a single shot without using pre-generated proposals, including anchor-based [26] and anchor-free methods [19,47]. Besides, Transformers [41], with its powerful ability of long-range relation modeling, are recently also considered in some single-stage TAL methods [25,36,48]. Sound event detection (SED) focuses on recognizing and locating audio events in pure acoustic environments [27].…”
Section: Uni-modal Temporal Localization Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…Vid2Seq achieves stateof-the-art results on various dense event captioning datasets, as well as multiple video paragraph captioning and standard video clip captioning benchmarks. Finally, we believe the sequence-to-sequence design of Vid2Seq has the potential to be extended to a wide range of other video tasks such as temporally-grounded video question answering [51,56,57] or temporal action localization [16,67,123].…”
Section: Discussionmentioning
confidence: 99%
“…However, the predicted proposal relies heavily on local information and does not make full use of context relations. In order to model long-range context, some current works, such as the RTD-Net [ 5 ] and TadTR [ 28 ], regard video as a temporal sequence and introduce a self-attention transformer structure. Because using the attention mechanism in the whole sequence is inefficient and will introduce irrelevant noise interference, ActionFormer [ 4 ] proposed a local attention mechanism that limits the attention range within a fixed window.…”
Section: Related Workmentioning
confidence: 99%