Video behaviour recognition and semantic recognition understanding are important components of intelligent video analytics. Traditionally, human behaviour recognition has met problems of low recognition efficiencies and poor accuracies. For example, most existing behaviour recognition methods use the video frames obtained by even segmentation and fixed sampling as the input, which may lose important information between sampling intervals, fail to identify the key frames of the video segments and make use of the contextual semantics to understand current behaviour. In order to improve the semantic understanding capacity and efficiency of video segments, this paper adopts a 3-layer semantic recognition approach based on key frame extraction. First, it completes the segmentation for video recognition at the bottom layer, extracts the key frames in the video segments, primarily understands basic semantics of the persons' identifications, behaviours and environment, and then introduces the primarily acquired information into the middle layer for semantic integration, and through the integration of various semantics, adopts the loss function to learn the latent relationship between different modal semantics, to enhance the integrating capacity and the robustness of the character semantic integration, and finally, by overall fine tuning, semantic recognition and adjusting all the parameters of the network, completes the semantic recognition task of the video scenario. This method enjoys higher recognition accuracies based on certain datasets, capable of effectively recognizing the semantics of characters and behaviours in videos. Through practical testing, the adoption of the algorithm integrating key frame extractions with the video scene semantic recognition has improved the recognition accuracy and effect of the video character semantics.