Inspired from the promising performances achieved by recurrent neural networks (RNN) and convolutional neural networks (CNN) in action recognition based on skeleton, this paper presents a deep network structure which combines both CNN for classification and RNN to achieve attention mechanism for human interaction recognition. Specifically, the attention module in this structure is utilized to give various levels of attention to various frames by different weights, and the CNN is employed to extract the high-level spatial and temporal information of skeleton data. These two modules seamlessly form a single network architecture. In addition, to eliminate the impact of different locations and orientations, a coordinate transformation is conducted from the original coordinate system to the human-centric coordinate system. Furthermore, three different features are extracted from the skeleton data as the inputs of three subnetworks, respectively. Eventually, these subnetworks fed with different features are fused as an integrated network. The experimental result shows the validity of the proposed approach on two widely used human interaction datasets.