Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Carreira, João; Zisserman, Andrew

doi:10.1109/cvpr.2017.502

Cited by 7,656 publications

(6,510 citation statements)

References 39 publications

Supporting

Mentioning

6,469

Contrasting

Unclassified

Order By: Relevance

“…We outperform the state-of-the-art methods with a significant margin with the exception of I3D+ [4] (98% and 80.7%). Note that this method is pre-trained on additional 300,000 videos and relies on a two-stream variant.…”

Section: State-of-the-art Comparisonsmentioning

confidence: 92%

“…Furthermore, our four stream models do not improve significantly after the inclusion of improved trajectories (95.5% → 96.0% and 72.5% → 74.9%), showing that the vast majority of the benefit is intrinsic to the proposed architecture. This is interesting, as our four stream models are one of the first models together with I3D [4] which manages to surpass the 95% and 70% barriers on respective UCF101 and HMDB51 without relying on handcrafted features.…”

Section: State-of-the-art Comparisonsmentioning

confidence: 93%

“…In addition to ResNeXt-50 model, here we also train our model with the deeper ResNeXt-101 [75] and report its performance as well. In order to provide a fair comparison, we split the table into two parts, the ones incorporate their methods Method UCF101 HMDB51 CNN-hid6 [80] 79.3 -Comp-LSTM [62] 84.3 44.0 C3D+SVM [65] 85.2 -2S-CNN [78] 88.0 59.4 FSTCN [63] 88.1 59.1 2S-CNN+Pool [78] 88.2 -Objects+Motion(R * ) [26] 88.5 61.4 2S-CNN+LSTM [78] 88.6 -TDD [70] 90 [48] 86.0 60.1 FM+IDT [47] 87.9 61.1 MIFS+IDT [35] 89.1 65.1 CNN-hid6+IDT [80] 89.6 -C3D Ensemble+IDT (Sports-1M) [65] 90.1 -C3D+IDT+SVM [65] 90.4 -TDD+IDT [70] 91.5 65.9 Sympathy [9] 92.5 70.4 Two-Stream Fusion+IDT [15] 93.5 69.2 ST-ResNet+IDT [14] 94 [4] has been pre-trained on a large-scale video dataset, Kinetics300k.…”

Section: Dynamic Optical Flowmentioning

confidence: 99%

See 2 more Smart Citations

Action Recognition with Dynamic Image Networks

Bilen

Fernando

Gavves

et al. 2018

IEEE Trans. Pattern Anal. Mach. Intell.

196

164

View full text Add to dashboard Cite

Abstract-We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. When a linear ranking machine is used, the resulting representation is in the form of an image, which we call dynamic because it summarizes the video dynamics in addition of appearance. This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos. We also present an efficient and effective approximate rank pooling operator, accelerating standard rank pooling algorithms by orders of magnitude, and formulate that as a CNN layer. This new layer allows generalizing dynamic images to dynamic feature maps. We demonstrate the power of the new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

show abstract

Section: State-of-the-art Comparisonsmentioning

confidence: 92%

Section: State-of-the-art Comparisonsmentioning

confidence: 93%

Section: Dynamic Optical Flowmentioning

confidence: 99%

See 1 more Smart Citation

Action Recognition with Dynamic Image Networks

Bilen

Fernando

Gavves

et al. 2018

IEEE Trans. Pattern Anal. Mach. Intell.

196

164

View full text Add to dashboard Cite

show abstract

“…It also outperforms early 3D CNN architecture [9]. [10] and [11] obtain better results, however, these are much more complicated models which were pre-trained on much bigger dataset (Kinetics); consequently, their results are not directly comparable. The proposed model achieves good results without pre-training on bigger datasets, and thus it is better suited for use cases where the limited amount of training data is available.…”

Section: Discussionmentioning

confidence: 99%

“…C3D [9] uses relatively shallow CNN architecture, trained from scratch on large video datasets, that is applied to nonoverlapping frame clips with the classification result computed by averaging the scores predicted for all clips. I3D [10] proposes inflating 2D CNNs into 3D and bootstrapping 3D filters from 2D filters, which provides parameter initialization from 2D models trained on ImageNet. [11] explores even deeper ResNet-based architectures similar to those that worked well for image recognition.…”

Section: Introductionmentioning

confidence: 99%

Human action recognition using fusion of modern deep convolutional and recurrent neural networks

Tkachenko¹

2018

EasyChair Preprints

View full text Add to dashboard Cite

Abstract-This paper studies the application of modern deep convolutional and recurrent neural networks to video classification, specifically human action recognition. Multistream architecture, which uses the ideas of representation learning to extract embeddings of multimodal features, is proposed. It is based on 2D convolutional and recurrent neural networks, and the fusion model receives a video embedding as input. Thus, the classification is performed based on this compact representation of spatial, temporal and audio information. The proposed architecture achieves 93.1% accuracy on UCF101, which is better than the results obtained with the models that have a similar architecture, and also produces representations which can be used by other models as features; anomaly detection using autoencoders is proposed as an example of this.

show abstract

Deep Learning–Assisted Diagnosis of Cerebral Aneurysms Using the HeadXNet Model

et al. 2019

View full text Add to dashboard Cite

Key Points Question How does augmentation with a deep learning segmentation model influence the performance of clinicians in identifying intracranial aneurysms from computed tomographic angiography examinations? Findings In this diagnostic study of intracranial aneurysms, a test set of 115 examinations was reviewed once with model augmentation and once without in a randomized order by 8 clinicians. The clinicians showed significant increases in sensitivity, accuracy, and interrater agreement when augmented with neural network model–generated segmentations. Meaning This study suggests that the performance of clinicians in the detection of intracranial aneurysms can be improved by augmentation using deep learning segmentation models.

show abstract

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Cited by 7,656 publications

References 39 publications

Action Recognition with Dynamic Image Networks

Action Recognition with Dynamic Image Networks

Human action recognition using fusion of modern deep convolutional and recurrent neural networks

Deep Learning–Assisted Diagnosis of Cerebral Aneurysms Using the HeadXNet Model

Contact Info

Product

Resources

About