Human action can be decomposed into a series of temporally correlated motions. Since the traditional bag-of-wordsframework based on local features cannot model global motion evolution of actions, models like Recurrent Neural Network (RNN) [15] and VideoDarwin [5] are accordingly explored to capture video-wise temporal information. Inspired by VideoDarwin, in this paper, we present a novel hierarchical scheme to learn better video representation, called HiVideoDarwin. Specifically, we first use different ranking machines to learn motion descriptors of local video clips. Then, in order to model motion evolution, we encode features obtained in previous layer again using a ranking machine. Compared with VideoDarwin, HiVideoDarwin captures the global and high-level video representation and is robust to large appearance changes. Compared with RN-N, HiVideoDarwin can also abstract semantic information in a hierarchical way and is fast to compute and easy to interpret. We evaluate the proposed method on two datasets, namely MPII Cooking and Chalearn. Experimental results show that HiVideoDarwin has distinct advantages over the state-of-the-art models. Additional sensitivity analysis reveals that the overall results are hardly affected by parameter changes.