Object detection (OD) is a computer vision procedure for locating objects in digital images. Our study examines the crucial need for robust OD algorithms in human activity recognition, a vital domain spanning human-computer interaction, sports analysis, and surveillance. Nowadays, three-dimensional convolutional neural networks (3DCNNs) are a standard method for recognizing human activity. Utilizing recent advances in Deep Learning (DL), we present a novel framework designed to create a fusion model that enhances conventional methods at integrates three-dimensional convolutional neural networks (3DCNNs) with Convolutional Long-Short-Term Memory (ConvLSTM) layers. Our proposed model focuses on utilizing the spatiotemporal features innately present in video streams. An important aspect often missed in existing OD methods. We assess the efficacy of our proposed architecture employing the UCF-50 dataset, which is well-known for its different range of human activities. In addition to designing a novel deep-learning architecture, we used data augmentation techniques that expand the dataset, improve model robustness, reduce overfitting, extend dataset size, and enhance performance on imbalanced data. The proposed model demonstrated outstanding performance through comprehensive experimentation, achieving an impressive accuracy of 98.11% in classifying human activity. Furthermore, when benchmarked against state-of-the-art methods, our system provides adequate accuracy and class average for 50 activity categories.