Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Souza, César Roberto de; Gaidon, Adrien; Vig, Eleonora; López, Antonio

doi:10.1007/978-3-319-46478-7_43

Cited by 30 publications

(25 citation statements)

References 52 publications

(151 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these approaches still used handcrafted features. With the advent of deep learning, learning representations from data has been extensively studied [14,15,45,58,53,54,25,7,62,56,41,3]. Of these, one of the most popular frameworks has been the approach of Simonyan et al [45], who introduced the idea of training separate color and optical flow networks to capture local properties of the video.…”

Section: Related Workmentioning

confidence: 99%

Asynchronous Temporal Fields for Action Recognition

Sigurdsson

Divvala

Farhadi

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

138

View full text Add to dashboard Cite

Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades [43] benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization.

show abstract

Section: Related Workmentioning

confidence: 99%

Asynchronous Temporal Fields for Action Recognition

Sigurdsson

Divvala

Farhadi

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

138

View full text Add to dashboard Cite

show abstract

“…While there has been great progress in classification of objects in still images using convolutional neural networks (CNNs) [19,20,43,47], this has not been the case for action recognition. CNN-based representations [15,51,58,59,63] have not yet significantly outperformed the best hand-engineered descriptors [12,53]. This is partly due to missing large-scale video datasets similar in size and variety to ImageNet [39].…”

Section: Introductionmentioning

confidence: 99%

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

Girdhar

Ramanan

Gupta

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

482

338

View full text Add to dashboard Cite

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art twostream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-toend trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

show abstract

“…In addition to ResNeXt-50 model, here we also train our model with the deeper ResNeXt-101 [75] and report its performance as well. In order to provide a fair comparison, we split the table into two parts, the ones incorporate their methods Method UCF101 HMDB51 CNN-hid6 [80] 79.3 -Comp-LSTM [62] 84.3 44.0 C3D+SVM [65] 85.2 -2S-CNN [78] 88.0 59.4 FSTCN [63] 88.1 59.1 2S-CNN+Pool [78] 88.2 -Objects+Motion(R * ) [26] 88.5 61.4 2S-CNN+LSTM [78] 88.6 -TDD [70] 90 [48] 86.0 60.1 FM+IDT [47] 87.9 61.1 MIFS+IDT [35] 89.1 65.1 CNN-hid6+IDT [80] 89.6 -C3D Ensemble+IDT (Sports-1M) [65] 90.1 -C3D+IDT+SVM [65] 90.4 -TDD+IDT [70] 91.5 65.9 Sympathy [9] 92.5 70.4 Two-Stream Fusion+IDT [15] 93.5 69.2 ST-ResNet+IDT [14] 94 [4] has been pre-trained on a large-scale video dataset, Kinetics300k.…”

Section: Dynamic Optical Flowmentioning

confidence: 99%

Action Recognition with Dynamic Image Networks

Bilen

Fernando

Gavves

et al. 2018

IEEE Trans. Pattern Anal. Mach. Intell.

196

164

View full text Add to dashboard Cite

Abstract-We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. When a linear ranking machine is used, the resulting representation is in the form of an image, which we call dynamic because it summarizes the video dynamics in addition of appearance. This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos. We also present an efficient and effective approximate rank pooling operator, accelerating standard rank pooling algorithms by orders of magnitude, and formulate that as a CNN layer. This new layer allows generalizing dynamic images to dynamic feature maps. We demonstrate the power of the new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

show abstract

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Cited by 30 publications

References 52 publications

Asynchronous Temporal Fields for Action Recognition

Asynchronous Temporal Fields for Action Recognition

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

Action Recognition with Dynamic Image Networks

Contact Info

Product

Resources

About