Underwater robots are often used in marine exploration and development to assist divers in underwater tasks. However, the underwater robots on the market have some problems, such as only a single function of object detection or tracking, the use of traditional algorithms with low accuracy and robustness, and the lack of effective interaction with divers. To this end, we designed a type of gesture recognition based on interaction, using person tracking as an auxiliary means for an underwater accompanying robot (UAR). We train and test the SSDLite detection algorithm using the self-labeled underwater datasets, and combine the kernelized correlation filters (KCF) tracking algorithm with the “Active Control” target tracking rule to continuously track the underwater human body. Our experiments show that the use of underwater datasets and target tracking can effectively improve gesture recognition accuracy by 40–105%. In the outfield experiment, the performance of the algorithm was good. It achieved target tracking and gesture recognition at 29.4 FPS on Jetson Xavier NX, and the UAR made corresponding actions according to the diver gesture command.