The availability of large motion capture (mocap) data has sparked a great motivation for computer animation, and the task of automatically annotating complex mocap sequences plays an important role in the efficient motion analysis. To this end, this study presents an efficient human mocap data annotation approach by using multi-view spatiotemporal feature fusion. First, the authors exploit an improved hierarchical aligned cluster analysis algorithm to divide the unknown human mocap sequence into several sub-motion clips, and each sub-motion clip incorporates a particular semantic meaning. Then, the two kinds of multi-view features, namely most informative central distances and most informative geometric angles, are discriminatively extracted and temporally modelled by a Fourier temporal pyramid to complementarily characterise each motion clip. Finally, the authors utilise the discriminant correlation analysis to fuse these two types of motion features and further employ an extreme learning machine to annotate each sub-motion clip. The extensive experiments tested on the public available database have demonstrated the effectiveness of the proposed approach in comparison with the existing counterparts.