Traditional target detection of video images is often used to distinguish the relevant classification of large categories of targets, in the case of complex and diverse image content, it cannot capture enough visual cues, which makes it difficult to distinguish small differences between categories. Therefore, this article studies the salient target detection method of video images based on convolution neural network. Based on dictionary learning, the dynamic features of videos are extracted, and then the coefficient matrix is generated based on the dictionary to complete the learning, so as to realize the complete description of the underlying dynamics of videos. DMD algorithm is used to extract the dynamic mode of videos, and finally the foreground and background of video image frames are separated. Based on YOLOv4 network model, the salient target detection model of video images is constructed. Aiming at the defects of YOLOv4 network model, such as redundant parameters, many convolution modules and complex architecture, a series of model optimization are carried out. Experimental results verify the effectiveness of the model.