Hand belongs to non-rigid objects and is rich in variety, making gesture recognition more difficult. The essence of dynamic gesture recognition is the classification and recognition of single-frame still images. Therefore, this paper mainly focuses on static gesture recognition. At present, there are some problems in gesture recognition, such as accuracy, real-time or poor robustness. To solve the above problems, in this paper, the Kinect sensor is used to obtain the color and depth gesture samples, and the gesture samples are processed. On this basis, a jointly network of CNN and RBM is proposed for gesture recognition. It mainly uses superposed network of multiple RBMs to carry out unsupervised feature extraction and combined with supervised feature extraction of CNN. Finally, these two features are combined to classify them. The simulation results show that the proposed jointly network has a better performance in identifying simple background gesture samples, and the recognition capability of gesture samples in complex background needs to be improved.