In this paper, we analyze the construction of cross-media collaborative filtering neural network model to design an in-depth model for fast video click-through rate projection based on cross-media collaborative filtering neural network. In this paper, by directly extracting the image features, behavioral features, and audio features of short videos as video feature representation, more video information is considered than other models. The experimental results show that the model incorporating multimodal elements improves AUC performance metrics compared to those without multimodal features. In this paper, we take advantage of recurrent neural networks in processing sequence information and incorporate them into the deep-width model to make up for the lack of capability of the original deep-width model in learning the dependencies between user sequence data and propose a deep-width model based on attention mechanism to model users’ historical behaviors and explore the influence of different historical behaviors of users on current behaviors using the attention mechanism. Data augmentation techniques are used to deal with cases where the length of user behavior sequences is too short. This paper uses the input layer and top connection when introducing historical behavior sequences. The models commonly used in recent years are selected for comparison, and the experimental results show that the proposed model improves in AUC, accuracy, and log loss metrics.