In this paper, we present a new geometric-temporal representation for visual action recognition based on local spatio-temporal features. First, we propose a modified covariance descriptor under the log-Euclidean Riemannian metric to represent the spatiotemporal cuboids detected in the video sequences. Compared with previously proposed covariance descriptor, our descriptor can be measured and clustered in the Euclidian space. Second, to capture the geometric-temporal contextual information, we construct a Directional Pyramid Co-occurrence Matrix (DPCM) to describe the spatio-temporal distribution of the vector-quantized local feature descriptors extracted from a video. DPCM characterizes the co-occurrence statistics of local features as well as the spatio-temporal positional relationships among the concurrent features. These statistics provide strong descriptive power for action recognition. To use DPCM for action recognition, we propose a Directional Pyramid Co-occurrence Matching Kernel (DPCMK) to measure the similarity of videos. The proposed method achieves the state-of-the-art performance and largely improves the recognition performance over the bag of visual words (BOVW) models on six public datasets. For example, on the KTH dataset, it achieves 98.78% accuracy while the BOVW approach only achieves 88.06%. On both Weizmann and UCF CIL datasets, the highest T possible accuracy of T 100% is achieved.Index Terms-Covariance cuboid descriptor, log-Euclidean Riemannian metric, spatio-temporal directional pyramid co-occurrence matrix, kernel machine, action recognition