In the current era of technological development, human actions can be recorded in public places like airports, shopping malls, and educational institutes, etc., to monitor suspicious activities like terrorism, fighting, theft, and vandalism. Surveillance videos contain adequate visual and motion information for events that occur within a camera’s view. Our study focuses on the concept that actions are a sequence of moving body parts. In this paper, a new descriptor is proposed that formulates human poses and tracks the relative motion of human body parts along with the video frames, and extracts the position and orientation of body parts. We used Part Affinity Fields (PAFs) to acquire the associated body parts of the people present in the frame. The architecture jointly learns the body parts and their associations with other body parts in a sequential process, such that a pose can be formulated step by step. We can obtain the complete pose with a limited number of points as it moves along the video and we can conclude with a defined action. Later, these feature points are classified with a Support Vector Machine (SVM). The proposed work was evaluated on the benchmark datasets, namely, UT-interaction, UCF11, CASIA, and HCA datasets. Our proposed scheme was evaluated on the aforementioned datasets, which contained criminal/suspicious actions, such as kick, punch, push, gun shooting, and sword-fighting, and achieved an accuracy of 96.4% on UT-interaction, 99% on UCF11, 98% on CASIA and 88.72% on HCA.