(a) (b) Figure 1: (a) We exploit the visual similarity between mocap-generated trajectories (left) and dense trajectories (right) to improve cross-view action recognition. (b) For mocap-trajectories, we can easily obtain corresponding features (i.e., descriptors for trajectories that originate from the same 3D point) in two views. We use these pairs of features to learn the transformation function for viewpoint change.
OverviewA view-invariant representation of human motion is crucial for effective action recognition. However, most view-invariant representations require either tracking of body parts or multi-view video data for learning which may not be a practical approach in many real-life scenarios. We describe a view-independent model for human action which is flexible, actionindependent, and requires no multi-view video data or additional labelling effort.We present a novel method for cross-view action recognition. Using a large collection of motion capture data we synthesize mocap-trajectory features from multiple viewpoints. Features originating from the same 3D point on the surface correspond, and this allows us to learn a feature transformation function for viewpoint change. Given this function, we can "hallucinate" the action descriptors of a video for different viewing angles. We use these hallucinated examples as additional training data to make our model view-invariant. We demonstrate the effectiveness of our approach on the unsupervised scenario of the INRIA IXMAS dataset.
MethodologyThe approach has three steps:Generating training data We adapt the mocap trajectory generation pipeline of Gupta et al.[1], which uses a human model with cylindrical primitives (see Figure 1(b)). Each limb consists of a collection of points that are placed on a 3D surface. Given a camera viewpoint, these points are projected under orthographic projection and tracked for L(=15) consecutive frames to generate trajectory descriptors similar to the densetrajectories of Wang et al. [3]. The resulting displacement vectors are then used to generate trajectory features. Given two arbitrary viewpoints, we can find a correspondence between features that originate from the same point on the surface (see Figure 1(b)).Learning the transformation function We quantize the mocap trajectory features using a fixed codebook C of size n. Given a source camera elevation angle θ and relative change in viewpoint given by ∆ = (δ θ , δ φ ), we define the training set D ∆ θ = {( f i , g i )} m 1 to be the set of m pairs ( f , g) ∈
MethodAverage accuracy Ours 71.7% nCTE based matching [1] 67.4% w/o aug.62.1% Hankelets [2] 56.4%