2016
DOI: 10.1007/978-3-319-46478-7_43
|View full text |Cite
|
Sign up to set email alerts
|

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Abstract: Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for image classification and showing promise for videos, has still not clearly superseded action recognition methods using hand-crafted features, even when training on massive datasets. In this paper, we introduce hybrid video classification architectures based on carefully designed uns… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
1

Year Published

2017
2017
2022
2022

Publication Types

Select...
6
2
2

Relationship

0
10

Authors

Journals

citations
Cited by 30 publications
(25 citation statements)
references
References 52 publications
(151 reference statements)
0
24
1
Order By: Relevance
“…However, these approaches still used handcrafted features. With the advent of deep learning, learning representations from data has been extensively studied [14,15,45,58,53,54,25,7,62,56,41,3]. Of these, one of the most popular frameworks has been the approach of Simonyan et al [45], who introduced the idea of training separate color and optical flow networks to capture local properties of the video.…”
Section: Related Workmentioning
confidence: 99%
“…However, these approaches still used handcrafted features. With the advent of deep learning, learning representations from data has been extensively studied [14,15,45,58,53,54,25,7,62,56,41,3]. Of these, one of the most popular frameworks has been the approach of Simonyan et al [45], who introduced the idea of training separate color and optical flow networks to capture local properties of the video.…”
Section: Related Workmentioning
confidence: 99%
“…While there has been great progress in classification of objects in still images using convolutional neural networks (CNNs) [19,20,43,47], this has not been the case for action recognition. CNN-based representations [15,51,58,59,63] have not yet significantly outperformed the best hand-engineered descriptors [12,53]. This is partly due to missing large-scale video datasets similar in size and variety to ImageNet [39].…”
Section: Introductionmentioning
confidence: 99%
“…In addition to ResNeXt-50 model, here we also train our model with the deeper ResNeXt-101 [75] and report its performance as well. In order to provide a fair comparison, we split the table into two parts, the ones incorporate their methods Method UCF101 HMDB51 CNN-hid6 [80] 79.3 -Comp-LSTM [62] 84.3 44.0 C3D+SVM [65] 85.2 -2S-CNN [78] 88.0 59.4 FSTCN [63] 88.1 59.1 2S-CNN+Pool [78] 88.2 -Objects+Motion(R * ) [26] 88.5 61.4 2S-CNN+LSTM [78] 88.6 -TDD [70] 90 [48] 86.0 60.1 FM+IDT [47] 87.9 61.1 MIFS+IDT [35] 89.1 65.1 CNN-hid6+IDT [80] 89.6 -C3D Ensemble+IDT (Sports-1M) [65] 90.1 -C3D+IDT+SVM [65] 90.4 -TDD+IDT [70] 91.5 65.9 Sympathy [9] 92.5 70.4 Two-Stream Fusion+IDT [15] 93.5 69.2 ST-ResNet+IDT [14] 94 [4] has been pre-trained on a large-scale video dataset, Kinetics300k.…”
Section: Dynamic Optical Flowmentioning
confidence: 99%