“…Compared with the conventional RGB video, 3D skeleton owning high-level representation is light-weight and robust to both view differences and complicated background. Therefore, 3D skeleton based action recognition has been widely investigated with methods based on handcrafted features (Weng et al, 2017;Xia et al, 2012), Convolutional Neural Networks (CNNs) (Ke et al, 2017a,b;Li et al, 2017;Hou et al, 2018;Li et al, 2019a;Xu et al, 2018;, Recurrent Neural Networks (RNNs) (Li et al, 2018(Li et al, , 2019bLiu et al, 2018;Song et al, 2017) and Graph Convolutional Networks (GCNs) (Yan et al, 2018;Shi et al, 2019b;Ye et al, 2020;Zhang et al, 2020a;Shi et al, 2019a;Kong et al, 2022;Gao et al, 2021;Liu et al, 2022;Peng et al, 2021). However, these methods are developed in a fully supervised manner and require extensive annotated labels, which is expensive and time-consuming.…”