2019
DOI: 10.1109/access.2019.2953113
|View full text |Cite
|
Sign up to set email alerts
|

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Abstract: The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these thre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 39 publications
0
7
0
Order By: Relevance
“…It results more computational time compared to single feature for extracting feature as well as classifying activity. On the other hand, the proposed approach is outperforming the various DL method [36,37,38,39,59,60,61] that shown in the Table 2. The major challenge with DL methods is that it requires large sampled data to classify actions efficiently.…”
Section: Resultsmentioning
confidence: 88%
See 1 more Smart Citation
“…It results more computational time compared to single feature for extracting feature as well as classifying activity. On the other hand, the proposed approach is outperforming the various DL method [36,37,38,39,59,60,61] that shown in the Table 2. The major challenge with DL methods is that it requires large sampled data to classify actions efficiently.…”
Section: Resultsmentioning
confidence: 88%
“…They have achieved 93.6% and 66.2% accuracies on UCF101 and HMDB51 datasets, respectively. Wang et al [39] have proposed CNN based Semantic Action-Aware Spatial-Temporal Features for action recognition with 71.2%, 45.6%, 95.9%, and 74.8% accuracies on Kinetics-400, Something-Something-V1, UCF101 and HMDB51 datasets, respectively. Xia and Wen [59] proposed a multi-stream based on key frame sampling for HAR.…”
Section: Literature Reviewmentioning
confidence: 99%
“…TSN [9] is the baseline, followed by some 3D CNN based methods, including I3D [18], ECO [22] and SAST [48] in the middle, and some 2D CNN based methods, including TRN [17], TPN [2], TSM [15] MTD 2 P [49] and CorrNet [12] and the lower part.Compare the results of 8 frames,our network exceeds other 3D based methods on V1. As for the results of 24 frames, we surpass ECOEn [22] and SAST [48] by 5% and 5.8%, respectively. In the meantime, those 3D based methods are pre-trained on the very large dataset (e.g.…”
Section: Implementation Details Results On Something-something Datasetsmentioning
confidence: 99%
“…Since the 2D CNN cannot explicitly learn the temporal information in the video, some works adopted sparse temporal sampling strategies [5], [22], [23] or explored temporal dependencies between video frames at multiple time scales [24]- [26]. Compared with 2D CNN, 3D CNN-based methods [27]- [30] are able to directly extract spatial-temporal features from videos. Although these works have achieved good performance, the use of 3D CNN is limited by its parameters and computational overhead, prompting the emergence of the works on 3D convolutional kernel factorization [31]- [33] to balance the accuracy and model cost.…”
Section: Related Workmentioning
confidence: 99%