Traditional robotic arms rely on complex programming and predefined trajectories to operate, which limits their applicability. To improve the flexibility and adaptability of the robot arm, the research focuses on improving the grasping performance of the robot arm based on vision technology. Kinect technology is used to capture human arm movements, and Kalman filter is introduced to smooth image data, so as to optimize the motion recognition process. In this study, the residual network model is further improved, and ELU activation function and pre-activation mechanism are introduced to enhance the classification accuracy of gesture images. The results showed that the improved ResNet50 model achieves 95% recognition accuracy after 25 iterations of training, while the original model is 80%. The application of Kalman filter makes the motion tracking curve smoother and shows the correction effect of this method. In simulation tests, the robotic arm is able to identify different elbow bending angles with 90–96 percent accuracy, while mimicking five specific hand gestures with 96–98 percent accuracy. These data support the practicability and effectiveness of the application of vision capture technology and deep learning model in the field of intelligent control of robotic arms.