Relaxed Transformer Decoders for Direct Action Proposal Generation

Jing, Tan; Tang, Jiaqi; Wang, Limin; Wu, Guorong

doi:10.48550/arxiv.2102.01894

Cited by 9 publications

(12 citation statements)

References 24 publications

(57 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among them VSGN [64] achieved the best performance by exploiting correlations between cross-scale snippets (original and magnified) and aggregating their features with a graph pyramid network. AGT [65], RTD-Net [66], ATAG [63],…”

Section: Fully-supervised Methodsmentioning

confidence: 99%

“…and TadTR [152] use transformers to model long-range dependencies. Among them RTD-Net [66] achieved the There are also two state-of-the-art (SOTA) methods that do not belong to the mentioned categories of methods. TSP [154] proposed a novel supervised pretraining paradigm for clip features, and improved the performance of SOTA using features trained with the proposed pretraining strategy.…”

Section: Fully-supervised Methodsmentioning

confidence: 99%

“…Transformer AGT [65], RTD-Net [66] ATAG [63], TadTR [152] + Modeling non-linear temporal structure and inter-proposal relationships for proposal generation. -High parametric complexity.…”

Section: Rnnsmentioning

confidence: 99%

“…the input video) and graph structured query embeddings (latent representations of the action queries).Tan et al in RTD-Net[66] proposed a relaxed transformer to directly generate action proposals without the need to human prior knowledge for careful design of anchor placement or boundary matching mechanisms. The transformer encoder models long-range temporal context and captures inter-proposal relationships from a global view to precisely localize action instances.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper also reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.

show abstract

Section: Fully-supervised Methodsmentioning

confidence: 99%

Section: Fully-supervised Methodsmentioning

confidence: 99%

“…Transformer AGT [65], RTD-Net [66] ATAG [63], TadTR [152] + Modeling non-linear temporal structure and inter-proposal relationships for proposal generation. -High parametric complexity.…”

Section: Rnnsmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Tian¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Thus, transformers are especially good at modelling long-range dependencies between elements of a sequence. Since then, there have been several attempts to adapt transformers towards vision tasks including object detection [2,56], image classification [8,41,52,46,14], segmentation [44], multiple object tracking [37,29], human pose estimation [50,55], point cloud processing [12,54], video processing [10,31,38], image super-resolution [30,49,3], image synthesis [9], etc. An extensive review is out of the scope of this paper.…”

Section: Transformers and Vision Transformersmentioning

confidence: 99%

LocalViT: Bringing Locality to Vision Transformers

Li,

Zhang,

Cao

et al. 2021

Preprint

125

146

View full text Add to dashboard Cite

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T [41] and PVT-T [46] by 2.6% and 3.1% with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT.

show abstract

Temporal Action Localization With Coarse-to-Fine Network

Zhang

2022

IEEE Access

View full text Add to dashboard Cite

Precisely localizing temporal intervals for each action segment in long raw videos is essential challenge in practical video content analysis (e.g., activity detection or video caption generation). Most of previous works often neglect the hierarchical action granularity and eventually fail to identify precise action boundaries. (e.g., embracing approaching or turning a screw in mechanical maintenance). In this paper, we introduce a simple yet efficient coarse-to-fine network (CFNet) to solve the challenging issue of temporal action localization by progressively refining action boundary at multiple action granularities. The proposed CFNet is mainly composed of three components: a coarse proposal module (CPM) to generate coarse action candidates, a fusion block (FB) to enhance feature representation by fusing the coarse candidate features and corresponding features of raw input frames, and a boundary transformer module (BTM) to further refine action boundaries. Specifically, CPM exploits framewise, matching and gated actionness curves to complement each other for coarse candidate generation at different levels, while FB is devised to enrich feature representation by fusing the last feature map of CPM and corresponding raw frame input. Finally, BTM learns long-term temporal dependency with a transformer structure to further refine action boundaries at a finer granularity. Thus, the fine-grained action intervals can be incrementally obtained. Compared with previous state-of-the-art techniques, the proposed coarse-to-fine network can asymptotically approach finegrained action boundary. Comprehensive experiments are conducted on both publicly available THUMOS14 and ActivityNet-v1.3 datasets, and show the outstanding improvements of our method when compared with the prior methods on various video action parsing tasks.

show abstract

Relaxed Transformer Decoders for Direct Action Proposal Generation

Cited by 9 publications

References 24 publications

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

LocalViT: Bringing Locality to Vision Transformers

Temporal Action Localization With Coarse-to-Fine Network

Contact Info

Product

Resources

About