Video is one of the major media human uses to store information. As the recording and storing devices become cheaper, there are numerous videos generated nowadays. The unprecedentedly large volume creates considerable new requirements on accessing the videos. Therefore, how to perform video filtering, i.e. obtaining a set of relevant video clips from the video repository becomes a challenging research topic. In previous works, video filtering required user entering some texts to filter the irrelevant video clips, which made the video filtering methods same as the document filtering methods for a long time. However, there are three limitations of the text-based video filtering: (1) it dismisses the rich contents in the videos; (2) it is inapplicable when the texts are absent, incomplete or sparse; (3) it fails to support in-video filtering. These limitations make the text-based video filtering powerless after the new requirements emerge. In recent years, there sees a tendency that computer could parse more meaningful contents from the videos. These non-textual contents are complementary to the texts in many cases. Enlightened by that, existing video filtering research gradually shifts from text-based to non-textual-based. Under this direction, we study how to improve the video filtering systematically from three levels.Frame-level. We propose to use detected visual object to filter the videos. In previous works, the visual objects were obtained manually where human took the responsibility of identifying the visual objects and connecting them in the videos. The process of obtaining the visual objects is costly when the data keep changing. Therefore, we proposed to leverage the object detection to obtain the visual objects automatically for frame-level filtering. However, object detection itself is unable to identify and connect the visual objects like human. To achieve that, we proposed a hybrid method to identify and connect the visual objects, which is further divided into local merge, propagation and global merge. We examined the proposed method on a real-world dataset then studied two issues:(1) whether the identifications and connections were accurate, as well as (2) how the environment influenced the proposed method. The experimental results were promising and proved that using detected visual objects for frame-level filtering is feasible.Video-level. We discover a new small content set for surveillance video filtering. Surveillance video filtering, namely surveillance event detection (SED), is important for many safety and security applications. It aims to alarm the events from the surveillance videos. Different from classical video filtering which extracts video content vectors from diverse sources, SED is only able to leverage the motion contents. And the state-of-the-art content set for surveillance is made up of STIP and MoSIFT.iii In our study, we proposed a new content set by using dense trajectory (DT) and improved dense trajectory (IDT). According to our analysis, our new content set capture...