Visual indexing, or the ability to search and analyze visual media such as images and videos, is important for law enforcement agencies because it can speed up criminal investigations. As more and more visual media is created and shared online, the ability to effectively search and analyze this data becomes increasingly important for investigators to do their job effectively. The major challenges for video captioning include accurately recognizing the objects and activities in the image, understanding their relationships and context, generating natural and descriptive language, and ensuring the captions are relevant and useful. Near real-time processing is also required in order to facilitate agile forensic decision making and prompt triage, hand-over and reduction of the amount of data to be processed by investigators or subsequent processing tools. This paper presents a captioning-driven efficient video analytic which is able to extract accurate descriptions of images and videos files. The proposed approach includes a temporal segmentation technique providing the most relevant frames. Subsequently, an image captioning approach has been specialized to describe visual media related to counter-terrorism and cybercrime for each relevant frame. Our proposed method achieves high consistency and correlation with human summary on SumMe dataset, outperforming previous similar methods.