2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.502
|View full text |Cite
|
Sign up to set email alerts
|

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

25
6,469
6
10

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7,656 publications
(6,510 citation statements)
references
References 39 publications
25
6,469
6
10
Order By: Relevance
“…We outperform the state-of-the-art methods with a significant margin with the exception of I3D+ [4] (98% and 80.7%). Note that this method is pre-trained on additional 300,000 videos and relies on a two-stream variant.…”
Section: State-of-the-art Comparisonsmentioning
confidence: 92%
See 2 more Smart Citations
“…We outperform the state-of-the-art methods with a significant margin with the exception of I3D+ [4] (98% and 80.7%). Note that this method is pre-trained on additional 300,000 videos and relies on a two-stream variant.…”
Section: State-of-the-art Comparisonsmentioning
confidence: 92%
“…Furthermore, our four stream models do not improve significantly after the inclusion of improved trajectories (95.5% → 96.0% and 72.5% → 74.9%), showing that the vast majority of the benefit is intrinsic to the proposed architecture. This is interesting, as our four stream models are one of the first models together with I3D [4] which manages to surpass the 95% and 70% barriers on respective UCF101 and HMDB51 without relying on handcrafted features.…”
Section: State-of-the-art Comparisonsmentioning
confidence: 93%
See 1 more Smart Citation
“…It also outperforms early 3D CNN architecture [9]. [10] and [11] obtain better results, however, these are much more complicated models which were pre-trained on much bigger dataset (Kinetics); consequently, their results are not directly comparable. The proposed model achieves good results without pre-training on bigger datasets, and thus it is better suited for use cases where the limited amount of training data is available.…”
Section: Discussionmentioning
confidence: 99%
“…C3D [9] uses relatively shallow CNN architecture, trained from scratch on large video datasets, that is applied to nonoverlapping frame clips with the classification result computed by averaging the scores predicted for all clips. I3D [10] proposes inflating 2D CNNs into 3D and bootstrapping 3D filters from 2D filters, which provides parameter initialization from 2D models trained on ImageNet. [11] explores even deeper ResNet-based architectures similar to those that worked well for image recognition.…”
Section: Introductionmentioning
confidence: 99%