In recent years, inspired by the remarkable success in deep learning, many deep neural networks (DNN)-based methods have been proposed to solve the problem by intersection analysis. These approaches, for their enormous nonlinear learning capacity, show superior performance in cross-retrieval tasks compared to traditional cross-retrieval methods. Due to the increased amount of information for information recovery and limited storage capacity, the extraction of key-frame for videos and key feature extraction is becoming a fundamental challenge. In this paper, we have investigated the image-to-video retrieval challenge, and a new convolutional neural network (CNN)-based visual search system is presented to find similar videos in an extensive database based on the target image. Our framework includes a key frame extraction algorithm and feature aggregation strategy. Specifically, in the key frame extraction algorithm, by taking advantage of the extracted objects in each image, we present a new scheme to extract the features of each form. Therefore the redundant information in the video data will be removed, and the storage cost will also be reduced to an acceptable level. The feature aggregation strategy uses deep-local-complex feature fusion, which enables fast retrieval in a large-scale video database. The results of extensive experiments on publicly available datasets demonstrate that the proposed method achieves superior performance and accuracy compared to other advanced visual search methods.