In the realm of senior care, falls remain the primary cause of hospitalization and mortality. Early detection can facilitate timely medical interventions and mitigate the associated risks. Hence, it is beneficial to set up a mechanism to identify and track such incidents. While wearable sensors are commonly used for this purpose, they may not be feasible for daily use. In contrast, surveillance cameras provide an unobtrusive and convenient monitoring solution. Despite the intricate nature of video frames, employing deep learning techniques for human recognition can enhance the performance of such systems. In our approach, activity classification is triggered only when a person is detected in the environment. We then process the extracted Global History of Motion (GHM) images with our Lightweight Deep Neural Network (LDNN). Additionally, we employ a dilated Convolution Long Short Term Memory (ConvLSTM) coupled with our LDNN to analyze the extracted sequence of monocular depth frames. Ultimately, the decision is made based on the combined information from both streams. Empirical results demonstrate the good performance of our fall detection system on the UP-Fall and UR datasets compared to the state-of-the-art. In terms of person classification, the Xception network achieved an F1-score of 98.46% on a sub-dataset derived from the COCO dataset, and 100% on the UP-Fall dataset.