The majority of visual based surveillance applications and security systems heavily rely on object detection, which serves as a critical module. In the context of crime scene analysis, images and videos play an essential role in capturing visual documentation of a particular scene. By detecting objects associated with a specific crime, police officers are able to reconstruct a scene for subsequent analysis. Nevertheless, the task of identifying objects of interest can be highly arduous for law enforcement agencies, mainly because of the massive amount of data that must be processed. Hence, the main objective of this paper is to propose a DL-based model for detecting tracked objects such as handheld firearms and informing the authority about the threat before the incident happens. We have applied VGG-19, ResNet, and GoogleNet as our deep learning models. The experiment result shows that ResNet50 has achieved the highest average accuracy of 0.92% compared to VGG19 and GoogleNet, which have achieved 0.91% and 0.89%, respectively. Also, YOLOv6 has achieved the highest MAP and inference speed compared to the faster R-CNN.