Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413860
|View full text |Cite
|
Sign up to set email alerts
|

Deep Concept-wise Temporal Convolutional Networks for Action Localization

Abstract: Existing action localization approaches adopt shallow temporal convolutional networks (i.e., TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we intro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
7
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 28 publications
(8 citation statements)
references
References 96 publications
(186 reference statements)
1
7
0
Order By: Relevance
“…Our method achieves an average mAP of 62.7% ([0.3 : 0.1 : 0.7]), with an mAP of 65.6% at tIoU=0.5 and an mAP of 42.6% at tIoU=0.7, outperforming all previous methods by a large margin (+8.7% mAP at tIoU=0.5 and +11.6% mAP at tIoU=0.7). Our results stay on top of all single-stage methods, and also beats all previous two-stage methods, including the latest ones from [33,51,58,84]. Note that our method significantly outperforms the concurrent work of TadTR [46], which also designed a Transformer model for TAL.…”
Section: Resultssupporting
confidence: 58%
“…Our method achieves an average mAP of 62.7% ([0.3 : 0.1 : 0.7]), with an mAP of 65.6% at tIoU=0.5 and an mAP of 42.6% at tIoU=0.7, outperforming all previous methods by a large margin (+8.7% mAP at tIoU=0.5 and +11.6% mAP at tIoU=0.7). Our results stay on top of all single-stage methods, and also beats all previous two-stage methods, including the latest ones from [33,51,58,84]. Note that our method significantly outperforms the concurrent work of TadTR [46], which also designed a Transformer model for TAL.…”
Section: Resultssupporting
confidence: 58%
“…Depending on temporal level variations, for example, C-TCN [18] proposes Random Move to simulate the montage effect of video production. However, image level variations have been neglected for a long time by the TAD community so that most of previous works don't realize the effectiveness of image level data augmentation (ILDA) for TAD model training.…”
Section: Image Level Data Augmentationmentioning
confidence: 99%
“…[40] found that fusing RGB and Optical Flow streams at the last convolutional layer yields good visual modality features. The resulting mid-level features have been successfully employed by well performing TAL approaches [41,42,43,44]. In particular, they are utilized by G-TAD [2] to obtain feature representations for each temporal proposal.…”
Section: Related Workmentioning
confidence: 99%