With the rapid development of sensor technology and artificial intelligence, the video gesture recognition technology under the background of big data makes human-computer interaction more natural and flexible, bringing richer interactive experience to teaching, on-board control, electronic games, etc. In order to perform robust recognition under the conditions of illumination change, background clutter, rapid movement, partial occlusion, an algorithm based on multi-level feature fusion of two-stream convolutional neural network is proposed, which includes three main steps. Firstly, the Kinect sensor obtains RGB-D images to establish a gesture database. At the same time, data enhancement is performed on training and test sets. Then, a model of multi-level feature fusion of twostream convolutional neural network is established and trained. Experiments result show that the proposed network model can robustly track and recognize gestures, and compared with the single-channel model, the average detection accuracy is improved by 1.08%, and mean average precision (mAP) is improved by 3.56%. The average recognition rate of gestures under occlusion and different light intensity was 93.98%. Finally, in the ASL dataset, LaRED dataset, and 1-miohand dataset, recognition accuracy shows satisfactory performances compared to the other method.