2019 IEEE Winter Conference on Applications of Computer Vision (WACV) 2019
DOI: 10.1109/wacv.2019.00022
|View full text |Cite
|
Sign up to set email alerts
|

TAN: Temporal Aggregation Network for Dense Multi-Label Action Recognition

Abstract: We present Temporal Aggregation Network (TAN) which decomposes 3D convolutions into spatial and temporal aggregation blocks. By stacking spatial and temporal convolutions repeatedly, TAN forms a deep hierarchical representation for capturing spatio-temporal information in videos. Since we do not apply 3D convolutions in each layer but only apply temporal aggregation blocks once after each spatial downsampling layer in the network, we significantly reduce the model complexity. The use of dilated convolutions at… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 22 publications
(16 citation statements)
references
References 45 publications
0
16
0
Order By: Relevance
“…With recent advances in deep learning, there has been fruitful progress in video analysis. While the performance of activity recognition has improved a lot [26,27,22,6,18,32], the detection performance still remains unsatisfactory [20,9,13,7,34].…”
Section: Introductionmentioning
confidence: 99%
“…With recent advances in deep learning, there has been fruitful progress in video analysis. While the performance of activity recognition has improved a lot [26,27,22,6,18,32], the detection performance still remains unsatisfactory [20,9,13,7,34].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, Dai et al proposed to decompose 3D convolutions into aggregation blocks to better exploit the spatial-temporal nature of video. We adopt the TAN [9] model to obtain a visual representation from video. As illustrated in Figure 2, an input video V = {v t } Tv t=1 is encoded into a clip-level feature f v ∈ R T f ×d where T f is the total number of clips and d is the feature dimension.…”
Section: Single-shot Video Encodermentioning
confidence: 99%
“…A singlelayer LSTM with d = 512 hidden units is applied to obtain the sentence representation. For video encoder, TAN [9] is used for feature extraction. The model takes as input a clip of 8 RGB frames with spatial size 256 × 256 and extracts…”
Section: Implementation Detailsmentioning
confidence: 99%
“…On the other hand, Dai et al [17] and Yang et al [18] used dilated convolutions to preserve temporal receptive fields as they reduced downsampling along the temporal dimension. Dai et al [17] used 1D convolution blocks with multi-stride temporal dilation to handle with multi-scale temporal features on top of a 2D CNN model.…”
Section: Fine-grained Temporal Featuresmentioning
confidence: 99%
“…On the other hand, Dai et al [17] and Yang et al [18] used dilated convolutions to preserve temporal receptive fields as they reduced downsampling along the temporal dimension. Dai et al [17] used 1D convolution blocks with multi-stride temporal dilation to handle with multi-scale temporal features on top of a 2D CNN model. Yang et al [18] introduced temporal dilated convolution called temporal preservation convolution (TPC) to keep the time dimension by removing downsampling but to preserve the receptive field using strided convolutions.…”
Section: Fine-grained Temporal Featuresmentioning
confidence: 99%