In this paper, we use VR equipment to collect relevant facial expression images and normalize the angle, scale, and gray scale of the collected images. The direction quantization of image features is realized by 3D gradient computation, and then the histogram of the direction gradient of each video sub-block is cascaded into the final HOG3D descriptor so as to complete the extraction of dynamic expression features. In view of the multi-dimensional problem of the features, it is proposed to use a principal component analysis algorithm to reduce their dimensionality and use a multi-layer perceptron and deep confidence network to jointly construct the facial expression tracking recognition model. The datasets are used to analyze real-time facial expression tracking in virtual reality. The results present that the verification correctness of both datasets A and B reaches the maximum at the 120th iteration. In contrast, the loss value reaches the equilibrium state quickly at the 40th iteration. The dynamic occlusion expression recognition rate of the deep confidence network on dataset A (66.52%) is higher than that of the CNN (62.74%), which fully demonstrates that the method of this paper is able to effectively improve the performance of real-time facial expression tracking performance in virtual reality. This study can help computers further understand human emotions through facial expressions, which is of great significance to the development of the human-computer interaction field.