3D R Transform on Spatio-temporal Interest Points for Action Recognition

Yuan, Chunfeng; Li, Xi; Hu, Weiming; Ling, Haibin; Maybank, Stephen J.

doi:10.1109/cvpr.2013.99

Cited by 70 publications

(31 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…If motion information is available, both of the above two types of representation could be extended to their 3D versions by modeling the input sequences as a tensor, as in dense trajectory [20,4], action bank [21], among others [22][23][24]. These methods are related to our method but is unfortunately beyond the scope of the current work.…”

Section: Related Workmentioning

confidence: 99%

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Zhang

Tan

Jin

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

Abstract. In this paper, we present a new method which estimates the pose of a human body and identifies its action from one single static image. This is a challenging task due to the high degrees of freedom of body poses and lack of any motion cues. Specifically, we build a pool of pose experts, each of which individually models a particular type of articulation for a group of human bodies with similar poses or semantics (actions). We investigate two ways to construct these pose experts and show that this method leads to improved pose estimation performance under difficult conditions. Furthermore, in contrast to previous wisdoms of combining the output of each pose expert for action recognition using such method as majority voting, we propose a flexible strategy which adaptively integrates them in a discriminative framework, allowing each pose expert to adjust their roles in action prediction according to their specificity when facing different action types. In particular, the spatial relationship between estimated part locations from each expert is encoded in a graph structure, capturing both the non-local and local spatial correlation of the body shape. Each graph is then treated as a separate group, on which an overall group sparse constraint is imposed to train the prediction model, with extra weight added according to the confidence of the corresponding expert. We show in our experiments on a challenging web data set with state of the art results that our method effectively improves the tolerance of our system to imperfect pose estimation.

show abstract

Section: Related Workmentioning

confidence: 99%

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Zhang

Tan

Jin

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

show abstract

“…This segmentation stage involves all the problematic issues concerning illumination changes, shades, noise... In [23], authors capture the geometrical distribution of interest points extending the R transform to 3D. Our method is able to segment human actions from a video sequence with no need of a previous shape or silhouette extraction.…”

Section: Introductionmentioning

confidence: 99%

Temporal segmentation of human actions in video sequences

Carmona

Climent

2017

2017 Intelligent Systems Conference (IntelliSys)

View full text Add to dashboard Cite

Abstract-Most of the published works concerning action recognition, usually assume that the action sequences have been previously segmented in time, that is, the action to be recognized starts with the first sequence frame and ends with the last one. However, temporal segmentation of actions in sequences is not an easy task, and is always prone to errors. In this paper we present a new technique to automatically extract human actions from a video sequence.Our approach presents several contributions. First of all, we use a projection template scheme and find spatio-temporal features and descriptors within the projected surface, rather than extracting them in the whole sequence. For projecting the sequence we use a variant of the R transform, which has never been used before for temporal action segmentation. Instead of projecting the original video sequence, we project its optical flow components, preserving important information about action motion.We test our method on a publicly available action dataset, and the results show that it performs very well segmenting human actions compared with the state-of-the-art methods.

show abstract

“…However, the 3D R transform is little utilized. We deduce the form and properties of the 3D R transform, based on the 3D discrete Radon transform, and apply the 3D R transform to the representation of spatio-temporal interest points for the task of action recognition [53]. Afterwards, we apply (2D) 2 PCA [57] to the R transform, to reduce the dimension of the obtained feature.…”

Section: Introductionmentioning

confidence: 99%

Fusing $${\mathcal {R}}$$ R Features and Local Features with Context-Aware Kernels for Action Recognition

Yuan

et al. 2015

Int J Comput Vis

Self Cite

View full text Add to dashboard Cite

The performance of action recognition in video sequences depends significantly on the representation of actions and the similarity measurement between the representations. In this paper, we combine two kinds of features extracted from the spatio-temporal interest points with context-aware kernels for action recognition. For the action representation, local cuboid features extracted around interest points are very popular using a Bag of Visual Words (BOVW) model. Such representations, however, ignore potentially valuable information about the global spatio-temporal distribution of interest points. We propose a new global feature to capture the detailed geometrical distribution of interest points. It is calculated by using the 3D R transform which is defined as an extended 3D discrete Radon transform, followed by the application of a two-directional two-dimensional principal component analysis. For the similarity measurement, we model a video set as an optimized probabilistic hypergraph and propose a context-aware kernel to measure high order relationships among videos. The context-aware kernel is more robust to the noise and outliers in the data than the traditional context-free kernel which just considers the pairwise relationships between videos. The hyperedges of the hypergraph are constructed based on a learnt Mahalanobis distance metric. Any disturbing information from other classes is excluded from each hyperedge. Finally, a multiple kernel learning algorithm is designed by integrating the l 2 norm regularization into a linear SVM classifier to fuse the R feature and the BOVW representation for action recognition. Experimental results on several datasets demonstrate the effectiveness of the proposed approach for action recognition.

show abstract

3D R Transform on Spatio-temporal Interest Points for Action Recognition

Cited by 70 publications

References 24 publications

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Action Recognition from a Single Web Image Based on an Ensemble of Pose Experts

Temporal segmentation of human actions in video sequences

Fusing $${\mathcal {R}}$$ R Features and Local Features with Context-Aware Kernels for Action Recognition

Contact Info

Product

Resources

About