Automatic Sign Language Recognition (SLR) stands as a vital aspect within the realms of Human-Computer Interaction (HCI) and computer vision, facilitating the conversion of hand signs—utilized by individuals with significant hearing and speech impairments—into equivalent text or voice. Researchers have recently used hand skeleton joint information instead of the image pixel due to light illumination and complex background-bound problems. However, besides the hand information, body motion and facial gestures play an essential role in expressing sign language emotion. Also, a few researchers have been working to develop an SLR system by taking a multi-gesture dataset, but their performance accuracy and time complexity are not sufficient. In light of these limitations, we introduce a Spatial and Temporal Attention Model amalgamated with a General Neural Network designed for the SLR system. The main idea of our architecture is first to construct a fully connected graph to project the skeleton information. We employ self-attention mechanisms to extract insights from node and edge features across spatial and temporal domains. Our architecture bifurcates into three branches: a graph-based spatial branch, a graph-based temporal branch, and a general neural network branch, which collectively synergize to contribute to the final feature integration. Specifically, the spatial branch discerns spatial dependencies, while the temporal branch amplifies temporal dependencies embedded within the sequential hand skeleton data. Further, the general neural network branch enhances the architecture’s generalization capabilities, bolstering its robustness. In our evaluation, utilizing the Mexican Sign Language(MSL), Pakistani Sign Langauge(PSL) datasets and American Sign Langauge Large Video Dataset(ASSLVD) —which comprises 3D joint coordinates for face, body, and hands that conducted experiments on individual gestures and their combinations. Impressively, our model demonstrated notable efficacy, achieving an accuracy rate of 99.96% for the MSL dataset, 92.00% for PSL, and 26.00% for the ASLLVD dataset, which includes 2745 classes. These exemplary performance metrics, coupled with the model’s computationally efficient profile, underscore its preeminence compared to contemporaneous methodologies in the field.