As a natural, intuitive and easy-to-learn mode of interaction, gesture plays an important role in communication. Hand detection, containing multimodal information, includes static and dynamic detection and involves intricate spatial relationship problems such as different hand sizes, complex joints, occlusion and self-occlusion. This study focused on a multimodal hand gesture recognition system based on YOLOv5 and MediaPipe with fused spatio-temporal features. First, the Mediapipe and OpenCV libraries were employed to implement hand keypoint detection. Subsequently, the human–computer interaction (HCI) of volume control was realized by identifying the distance between thumb and index. Finally, model training was conducted based on the YOLOv5 algorithm, and the recognition of different gesture categories was realized. The performance was evaluated and compared through YOLOv5s, YOLOv5m, and YOLOv5l. The gesture recognition system interface visualization was achieved through pyqt5. Experiments show that the average detection accuracy of the model is 99.4% and the recognition speed is around 0.2[Formula: see text]s.