Human activity recognition (HAR) has emerged as a significant area of research due to its numerous possible applications, including ambient assisted living, healthcare, abnormal behaviour detection, etc. Recently, HAR using WiFi channel state information (CSI) has become a predominant and unique approach in indoor environments compared to others (i.e., sensor and vision) due to its privacy-preserving qualities, thereby eliminating the need to carry additional devices and providing flexibility of capture motions in both line-of-sight (LOS) and non-line-of-sight (NLOS) settings. Existing deep learning (DL)-based HAR approaches usually extract either temporal or spatial features and lack adequate means to integrate and utilize the two simultaneously, making it challenging to recognize different activities accurately. Motivated by this, we propose a novel DL-based model named spatio-temporal convolution with nested long short-term memory (STC-NLSTMNet), with the ability to extract spatial and temporal features concurrently and automatically recognize human activity with very high accuracy. The proposed STC-NLSTMNet model is mainly comprised of depthwise separable convolution (DS-Conv) blocks, feature attention module (FAM) and NLSTM. The DS-Conv blocks extract the spatial features from the CSI signal and add feature attention modules (FAM) to draw attention to the most essential features. These robust features are fed into NLSTM as inputs to explore the hidden intrinsic temporal features in CSI signals. The proposed STC-NLSTMNet model is evaluated using two publicly available datasets: Multi-environment and StanWiFi. The experimental results revealed that the STC-NLSTMNet model achieved activity recognition accuracies of 98.20% and 99.88% on Multi-environment and StanWiFi datasets, respectively. Its activity recognition performance is also compared with other existing approaches and our proposed STC-NLSTMNet model significantly improves the activity recognition accuracies by 4% and 1.88%, respectively, compared to the best existing method.