.As a major research field in computer vision, automatic generation of Labanotation has been attracting the attention of many researchers. These researchers use various methods to engage in dance education and protection, but the existing methods do not consider the spatial-temporal dependence modeling of dance movements. Therefore, they do not represent complex dance movements optimally. We propose an automatic generation network of Labanotation based on the long and short spatial-temporal relations. It includes local spatial-temporal feature extraction network, global spatial-temporal feature extraction network, and local and global feature fusion network. It can align the input and output sequences while modeling the skeletal spatial-temporal relation. The local spatial-temporal features among short interval skeleton sequence relationship are obtained through the multi-scale convolution of time and space. The global spatial-temporal features are learned through the transformer network to obtain the relationship among skeleton sequences with long interval. For the output of the two networks, we use the pyramid squeeze attention network to exchange information in long and short spatial-temporal information, achieving complementarity to improve the accuracy of action recognition. Experimental results show that the proposed method outperforms the state-of-the-art methods on laban16 and laban48, which are common datasets for the research of Labanotation.