The application of drones carrying different devices for aerial hovering operations is becoming increasingly widespread, but currently there is very little research relying on reinforcement learning methods for hovering control, and it has not been implemented on physical machines. Drone’s behavior space regarding hover control is continuous and large-scale, making it difficult for basic algorithms and value-based reinforcement learning (RL) algorithms to have good results. In response to this issue, this article applies a watcher-actor-critic (WAC) algorithm to the drone’s hover control, which can quickly lock the exploration direction and achieve high robustness of the drone’s hover control while improving learning efficiency and reducing learning costs. This article first utilizes the actor-critic algorithm based on behavioral value Q (QAC) and the deep deterministic policy gradient algorithm (DDPG) for drone hover control learning. Subsequently, an actor-critic algorithm with an added watcher is proposed, in which the watcher uses a PID controller with parameters provided by a neural network as the dynamic monitor, transforming the learning process into supervised learning. Finally, this article uses a classic reinforcement learning environment library, Gym, and a current mainstream reinforcement learning framework, PARL, for simulation, and deploys the algorithm to a practical environment. A multi-sensor fusion strategy-based autonomous localization method for unmanned aerial vehicles is used for practical exercises. The simulation and experimental results show that the training episodes of WAC are reduced by 20% compared to the DDPG and 55% compared to the QAC, and the proposed algorithm has a higher learning efficiency, faster convergence speed, and smoother hovering effect compared to the QAC and DDPG.