2015 IEEE International Conference on Computer Vision (ICCV) 2015
DOI: 10.1109/iccv.2015.522
|View full text |Cite
|
Sign up to set email alerts
|

Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks

Abstract: Human actions in video sequences are threedimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
346
0
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 522 publications
(348 citation statements)
references
References 36 publications
1
346
0
1
Order By: Relevance
“…Previous research on the use of deep learning for sleep science has been focused on PSG data [45,46]. In other application areas, deep learning has been used for human activity recognition [47,48] which is a similar technical problem. In a previous study, we combined human recognition of actigraphy data with other machine learning algorithms, but not deep learning [49].…”
Section: Discussionmentioning
confidence: 99%
“…Previous research on the use of deep learning for sleep science has been focused on PSG data [45,46]. In other application areas, deep learning has been used for human activity recognition [47,48] which is a similar technical problem. In a previous study, we combined human recognition of actigraphy data with other machine learning algorithms, but not deep learning [49].…”
Section: Discussionmentioning
confidence: 99%
“…In addition to ResNeXt-50 model, here we also train our model with the deeper ResNeXt-101 [75] and report its performance as well. In order to provide a fair comparison, we split the table into two parts, the ones incorporate their methods Method UCF101 HMDB51 CNN-hid6 [80] 79.3 -Comp-LSTM [62] 84.3 44.0 C3D+SVM [65] 85.2 -2S-CNN [78] 88.0 59.4 FSTCN [63] 88.1 59.1 2S-CNN+Pool [78] 88.2 -Objects+Motion(R * ) [26] 88.5 61.4 2S-CNN+LSTM [78] 88.6 -TDD [70] 90 [48] 86.0 60.1 FM+IDT [47] 87.9 61.1 MIFS+IDT [35] 89.1 65.1 CNN-hid6+IDT [80] 89.6 -C3D Ensemble+IDT (Sports-1M) [65] 90.1 -C3D+IDT+SVM [65] 90.4 -TDD+IDT [70] 91.5 65.9 Sympathy [9] 92.5 70.4 Two-Stream Fusion+IDT [15] 93.5 69.2 ST-ResNet+IDT [14] 94 [4] has been pre-trained on a large-scale video dataset, Kinetics300k.…”
Section: Dynamic Optical Flowmentioning
confidence: 99%
“…There has been a great deal of progress in human activity recognition in video captured from a third-person viewpoint. Early work contributed handcraft features to feature representation for activity recognition [11,12,13,14,15,16,17,18]. Some studies suggested various methods, such as support vector machine (SVM) [19,20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance.…”
Section: Related Workmentioning
confidence: 99%