2014 IEEE Conference on Computer Vision and Pattern Recognition 2014
DOI: 10.1109/cvpr.2014.339
|View full text |Cite
|
Sign up to set email alerts
|

Cross-View Action Modeling, Learning, and Recognition

Abstract: Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal AND-OR graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, app… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
415
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 488 publications
(415 citation statements)
references
References 22 publications
0
415
0
Order By: Relevance
“…Moreover, the accuracy achieved by a single local spatiotemporal feature or pose feature is lower than their combination in general. In our experiments, the proposed WSF-DS method achieves better recognition accuracy than most of the recently proposed methods based on the ideas of feature fusion using dendrogram and convolutional neural network features, such as MST, 52 AOG, 24 and P-CNN. 53 …”
Section: Comparison With the State-of-the-artmentioning
confidence: 89%
“…Moreover, the accuracy achieved by a single local spatiotemporal feature or pose feature is lower than their combination in general. In our experiments, the proposed WSF-DS method achieves better recognition accuracy than most of the recently proposed methods based on the ideas of feature fusion using dendrogram and convolutional neural network features, such as MST, 52 AOG, 24 and P-CNN. 53 …”
Section: Comparison With the State-of-the-artmentioning
confidence: 89%
“…The SBU-Kinect-Interaction dataset [132] contains skeleton data of a pair of subjects performing different interaction activities -one person acting and the other reacting. Many other datasets captured using a Kinect v1 camera were also released to the public, including the MSR Daily Activity 3D [130], MSR Action Pairs [122], Online RGBD Action (ORGBD) [116], UTKinect-Action [133], Florence 3D-Action [127], CMU-MAD [113], UTD-MHAD [112], G3D/G3Di [128,114], SPHERE [117], ChaLearn [120], RGB-D Person Re-identification [131], Northwestern-UCLA Multiview Action 3D [115], Multiview 3D Event [123], CDC4CV pose [64], SBU-Kinect-Interaction [132], UCF-Kinect [124], SYSU 3D Human-Object Interaction [109], Multi-View TJU [108], M 2 I [107], and 3D Iconic Gesture [125] datasets. The complete list of human-skeleton datasets collected using structured-light cameras are presented in Table 2.…”
Section: Datasets Collected By Structured-light Camerasmentioning
confidence: 99%
“…Existing approaches falling in each categories are summarized in detail in Tables 3-6, respectively. [115] Cross View BoW Body Dict Wei et al [123] 4D Interaction Conc Lowlv Hand Ellis et al [124] Latency Trade-off Conc Lowlv Hand Wang et al [130,138] Actionlet Conc Lowlv Hand Barbosa et al [131] Soft-biometrics Feature Conc Body Hand Yun et al [132] Joint-to-Plane Distance Conc Lowlv Hand Yang and Tian [139], [140] EigenJoints Conc Lowlv Unsup Chen and Koskela [141] Pairwise Joints Conc Lowlv Hand Rahmani et al [142] Joint Movement Volumes Stat Lowlv Hand Luo et al [143] Sparse Coding BoW Lowlv Dict Jiang et al [144] Hierarchical Skeleton BoW Lowlv Hand Yao and Li [145] 2.5D Graph Representation BoW Lowlv Hand Vantigodi and Babu [146] Variance of Joints Stat Lowlv Hand Zhao et al [147] Motion Templates BoW Lowlv Dict Yao et al [148] Coupled Recognition Conc Lowlv Hand Zhang et al [149] Star Skeleton BoW Lowlv Hand Zou et al [150] Key [156] Spectral Graph Skeletons Conc Lowlv Hand Cippitelli et al [157] Key Poses BoW Lowlv Dict…”
Section: Information Modalitymentioning
confidence: 99%
See 1 more Smart Citation
“…For each viewpoint we train a codebook with 2000 codewords using the k-means algorithm. In order to reduce the computational cost of clustering, we used another dataset, the Northwestern-UCLA Multiview Action 3D dataset [25], to build universal view-codebooks. Different from the datasets used to evaluate our method, each activity in this dataset consists of many complicated and arbitrary movements.…”
Section: Feature Representationmentioning
confidence: 99%