“…These algorithms parametrise the policy as a function of the system state, and update the policy parametrisation based on the gradient of the control objective. Most of the progress, especially the convergence analysis of PG methods, has been in discrete-time Markov decision processes (MDPs) (see e.g., [5,10,18,34,16]). However, most real-world control systems, such as those in aerospace, the automotive industry and robotics, are naturally continuous-time dynamical systems, and hence do not fit in the MDP setting.…”