Deep learning-based models have achieved promising performance for video behavior anomaly detection, but these models require high computational complexity for processing video data. Moreover, the performance is severely restricted by different challenges, i.e., lighting conditions, background noise and occlusion. The human skeleton has a compact spatial structure and rich semantic information, which can be more robust to video data defection. Although the human skeleton has good real-time performance, the recognition accuracy still needs further improvement. To this end, we propose a novel discrete variational feature transfer learning (DVFTL) framework in which the spatiotemporal graph-embedded Transformer module is designed to construct feature extraction backbones. Specifically, we extract human skeleton information from video frames, and combine graph convolution with the Transformer encoder to explore the local and global dependencies of joint points in the human skeleton. To analyze abnormal behavior uncertainty, we constructed the extraction network based on the discrete variational reconstruction mechanism. Moreover, to further improve the skeleton detection performance, we introduce the distillation transfer learning mechanism from the video variational network to the skeleton variational network. we conducted extensive comparative experiments on two publicly available datasets. The experimental results show that the skeleton-based variational network achieves high performance. Moreover, the introduction of the transfer learning mechanism can further improve the performance of our proposed model.