The explosive growth and rapid version iteration of various mobile applications have brought enormous workloads to mobile application testing. Robotic testing methods can efficiently handle repetitive testing tasks, which can compensate for the accuracy of manual testing and improve the efficiency of testing work. Vision-based robotic testing identifies the types of test actions by analyzing expert test videos and generates expert imitation test cases. The mobile application expert imitation testing method uses machine learning algorithms to analyze the behavior of experts imitating test videos, generates test cases with high reliability and reusability, and drives robots to execute test cases. However, the difficulty of estimating multi-dimensional gestures in 2D images leads to complex algorithm steps, including tracking, detection, and recognition of dynamic gestures. Hence, this article focuses on the analysis and recognition of test actions in mobile application robot testing. Combined with the improved YOLOv5 algorithm and the ResNet-152 algorithm, a visual modeling method of mobile application test action based on machine vision is proposed. The precise localization of the hand is accomplished by injecting dynamic anchors, attention mechanism, and the weighted boxes fusion in the YOLOv5 algorithm. The improved algorithm recognition accuracy increased from 82.6% to 94.8%. By introducing the pyramid context awareness mechanism into the ResNet-152 algorithm, the accuracy of test action classification is improved. The accuracy of the test action classification was improved from 72.57% to 76.84%. Experiments show that this method can reduce the probability of multiple detections and missed detection of test actions, and improve the accuracy of test action recognition.