Abstract:Self-supervised learning presents a remarkable performance to utilize unlabeled data for various video tasks. In this paper, we focus on applying the power of selfsupervised methods to improve semi-supervised action proposal generation. Particularly, we design an effective Selfsupervised Semi-supervised Temporal Action Proposal (SSTAP) framework. The SSTAP contains two crucial branches, i.e., temporal-aware semi-supervised branch and relation-aware self-supervised branch. The semisupervised branch improves the… Show more
“…Channel-Separated Convolutional Network (CSN) [13] aims to reduce the parameters of 3D convolution, and extract useful information by finding important channels simultaneously. It can efficiently learn feature representation Note that this module is borrowed from our SSTAP [18].…”
“…Temporal shift operation for action recognition is first applied in TSM [9], and then applied as a kind of perturbations in SSTAP [18] for semi-supervised learning. Here we reuse the perturbation as the feature augmentation.…”
Section: Data Augmentation Modulementioning
confidence: 99%
“…Action understanding is an important area in computer vision, and it draws growing attentions from both industry and academia because its use in human computer interaction, public security and some other far reaching applications. It includes many sub-research directions, such as Action Recognition [15,6,7], Temporal Action Detection [10,11,18], Spatio-Temporal Action Detection [12,8], etc. In this report, we introduce our method for the temporal action detection task in the 6-th ActivityNet challenge [2].…”
Section: Introductionmentioning
confidence: 99%
“…Figure 2. Detailed diagram of the data augmentation module.Note that this module is borrowed from our SSTAP[18].…”
This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The crucial challenge of the task comes from that the temporal duration of action varies dramatically, and the target actions are typically embedded in a background of irrelevant activities. Our solution builds on BMN [10], and mainly contains three steps: 1) action classification and feature encoding by Slowfast [6], CSN [13] and ViViT [1]; 2) proposal generation. We improve BMN by embedding the proposed Proposal Relation Network (PRN), by which we can generate proposals of high quality; 3) action detection. We calculate the detection results by assigning the proposals with corresponding classification results. Finally, we ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 [17] by 1.9% in terms of average mAP.
“…Channel-Separated Convolutional Network (CSN) [13] aims to reduce the parameters of 3D convolution, and extract useful information by finding important channels simultaneously. It can efficiently learn feature representation Note that this module is borrowed from our SSTAP [18].…”
“…Temporal shift operation for action recognition is first applied in TSM [9], and then applied as a kind of perturbations in SSTAP [18] for semi-supervised learning. Here we reuse the perturbation as the feature augmentation.…”
Section: Data Augmentation Modulementioning
confidence: 99%
“…Action understanding is an important area in computer vision, and it draws growing attentions from both industry and academia because its use in human computer interaction, public security and some other far reaching applications. It includes many sub-research directions, such as Action Recognition [15,6,7], Temporal Action Detection [10,11,18], Spatio-Temporal Action Detection [12,8], etc. In this report, we introduce our method for the temporal action detection task in the 6-th ActivityNet challenge [2].…”
Section: Introductionmentioning
confidence: 99%
“…Figure 2. Detailed diagram of the data augmentation module.Note that this module is borrowed from our SSTAP[18].…”
This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The crucial challenge of the task comes from that the temporal duration of action varies dramatically, and the target actions are typically embedded in a background of irrelevant activities. Our solution builds on BMN [10], and mainly contains three steps: 1) action classification and feature encoding by Slowfast [6], CSN [13] and ViViT [1]; 2) proposal generation. We improve BMN by embedding the proposed Proposal Relation Network (PRN), by which we can generate proposals of high quality; 3) action detection. We calculate the detection results by assigning the proposals with corresponding classification results. Finally, we ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 [17] by 1.9% in terms of average mAP.
“…Following mainstream action proposal generation methods [16,17,15,25,10,5,2,26,19,20,23,24], we preextract features for each video. Specifically, for a video which contains l frames, the whole video can be divided into N clips uniformly.…”
Temporal action localization aims to localize starting and ending time with action category. Limited by GPU memory, mainstream methods pre-extract features for each video. Therefore, feature quality determines the upper bound of detection performance. In this technical report, we explored classic convolution-based backbones and the recent surge of transformer-based backbones. We found that the transformer-based methods can achieve better classification performance than convolution-based, but they cannot generate accuracy action proposals. In addition, extracting features with larger frame resolution to reduce the loss of spatial information can also effectively improve the performance of temporal action localization. Finally, we achieve 42.42% in terms of mAP on validation set with a single SlowFast [9] feature by a simple combination: BMN [16]+TCANet [19], which is 1.87% higher than the result of 2020 [20]'s multi-model ensemble. Finally, we achieve Rank 1st on the CVPR2021 HACS supervised Temporal Action Localization Challenge.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.