Big data is an emerging paradigm applied to datasets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. With the pervasive of the definition of the smart city, the surveillance system, huge number of video surveillance devices such as surveillance cameras. Understanding the semantics of surveillance device has been an important component in many video-based applications. Manual annotation and tagging has been considered as a reliable source of video semantics. Unfortunately, manual annotation is time-consuming and expensive when dealing with huge scale of video data. However, the semantic gap between semantics and video visual appearance is still a challenge towards automated ontology-driven video annotation. Thus, automatically understanding raw videos solely based on their visual appearance becomes an important yet challenging problem. Thus, it is important to accurately describe the video content and enable the organizing and searching potential videos.