<span lang="EN-US">Human action recognition (HAR) has gathered significant attention recently due to its high demand in various application domains. We explore different CNN architectures for extracting Spatio-Temporal features. In addition to spatial correlation in 2D images, video actions also own the temporal domain attributes. Many research attempts addressed simple activity recognition, but complex activity recognition is still challenging. In the proposed work, the concept of transfer learning consists of two pre-trained networks, ResNet50 and VGG16, which are used to extract Spatiotemporal features and are later fused using the averaging method. Pre-processed frames are considered for the pre-trained network, the pre-processing consists of frame extraction, resizing, and contrast enhancement. The fused features are trained using a multiclass non-linear SVM classifier that achieved an accuracy of 89.71% on the UCF-101 database.</span>