Micro-expressions are a type of real emotional expression, which are unconscious and difficult to hide. Identifying these expressions has great potential applications in areas such as civil aviation security, criminal interrogation, and clinical medicine. However, because of their characteristics such as short duration, low intensity, and sparse action units, this makes micro-expression spotting difficult. To address this problem and inspired by object detection methods, we propose a VoVNet-based micro-expression spotting model, driven by multi-scale features. Firstly, VoVNet is used to achieve the extraction and reuse of different scale perceptual field features to improve the feature extraction capability. Secondly, multi-scale features are extracted and fused using the Feature Pyramid Network module, incorporating optical flow features, and by realizing the interactive fusion of fine-grained feature information and semantic feature information. Finally, the model is trained and optimized on CAS(ME)2 and SAMM Long Video. The experimental results show that the F1 score of the proposed model is improved by 0.1963 and 0.2441 on the two datasets compared with the baseline method, which outperforms the most popular spotting methods.