In this paper, the multi-scale spatio-temporal convolution module and the channel attention mechanism of global information synchronization have excellent performance in the field of sports gesture action recognition; therefore, combining the multi-scale spatio-temporal convolution and the channel attention mechanism to construct the dynamic gesture recognition model. The trained, dynamic gesture recognition model is deployed to cell phones and web pages through Java computational programming language, and the application scenarios of the basketball referee gesture recognition system are explained in detail. The real-time basketball referee gesture recognition system processes are analyzed using simulation analysis after determining the dataset. The basketball referee gesture recognition system’s video cropping frame analysis has an accuracy of 87.64% and 71.25%, with a loss value of 6.24 and 5.59, as shown in the results. When compared to the single-resolution features on the basketball referee gesture recognition system, the fused features exhibit some improvement, and the average recognition rate is 92.20%. This study is applicable to the referee gesture tracking method in complex basketball sports scenarios and lays the foundation for higher semantic level research in the field of sports video analytics, such as action recognition, event detection, and content understanding.