2021
DOI: 10.48550/arxiv.2102.01894
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Relaxed Transformer Decoders for Direct Action Proposal Generation

Abstract: Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and end-to-end learnable framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essenti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
12
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(12 citation statements)
references
References 24 publications
(57 reference statements)
0
12
0
Order By: Relevance
“…Among them VSGN [64] achieved the best performance by exploiting correlations between cross-scale snippets (original and magnified) and aggregating their features with a graph pyramid network. AGT [65], RTD-Net [66], ATAG [63],…”
Section: Fully-supervised Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Among them VSGN [64] achieved the best performance by exploiting correlations between cross-scale snippets (original and magnified) and aggregating their features with a graph pyramid network. AGT [65], RTD-Net [66], ATAG [63],…”
Section: Fully-supervised Methodsmentioning
confidence: 99%
“…and TadTR [152] use transformers to model long-range dependencies. Among them RTD-Net [66] achieved the There are also two state-of-the-art (SOTA) methods that do not belong to the mentioned categories of methods. TSP [154] proposed a novel supervised pretraining paradigm for clip features, and improved the performance of SOTA using features trained with the proposed pretraining strategy.…”
Section: Fully-supervised Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Thus, transformers are especially good at modelling long-range dependencies between elements of a sequence. Since then, there have been several attempts to adapt transformers towards vision tasks including object detection [2,56], image classification [8,41,52,46,14], segmentation [44], multiple object tracking [37,29], human pose estimation [50,55], point cloud processing [12,54], video processing [10,31,38], image super-resolution [30,49,3], image synthesis [9], etc. An extensive review is out of the scope of this paper.…”
Section: Transformers and Vision Transformersmentioning
confidence: 99%