Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

Kerimkulov, Bekzhan; Leahy, James-Michael; Šiška, David; Szpruch, Łukasz

doi:10.48550/arxiv.2201.07296

Cited by 2 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Conditions (ii)-(iv) help to ease the nonconvexity of φ → J(α φ ; ξ 0 ) and to reduce the oscillation of the loss function's curvature, which subsequently promotes the convergence of gradient-based algorithms (see [29]). Condition (ii), along with Example 2.3, also justifies recent reinforcement learning heuristics that adding f-divergences, such as the relative entropy, to the optimization objective can accelerate the convergence of PGMs (see e.g., [34,19]).…”

Section: Standing Assumptions and Main Resultsmentioning

confidence: 55%

“…As alluded to earlier, the map V A ∋ φ → J(α φ ; ξ 0 ) ∈ R ∪ {∞} is typically nonconvex and may not satisfy the Polyak-Lojasiewicz condition as in the setting with parametric policies ( [10,38,26,11,13,19]). Hence to ensure the linear convergence of the PPGM (1.7), we impose further conditions on the coefficients which guarantee that we are in one of the following five cases:…”

Section: Standing Assumptions and Main Resultsmentioning

confidence: 99%

“…Most existing theoretical results of PGMs, especially those establishing (optimal) linear convergence, focus on discrete time problems and restrict policies within specific parametric families. This includes Markov decision problems (MDPs) with softmax parameterized policies [26] or overparametrized one-hidden-layer neural-network policies [38,11,19], and discrete time linear (LQ) control problems with linear parameterized policies [10,13]. The analysis therein exploits heavily the specific structure of the considered (discrete time) control problems and policy parameterizations, and hence is difficult to extend to general continuous time control problems or general policy parameterizations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Linear convergence of a policy gradient method for finite horizon continuous time stochastic control problems

Reisinger¹,

Stockinger²,

Zhang³

2022

Preprint

View full text Add to dashboard Cite

Despite its popularity in the reinforcement learning community, a provably convergent policy gradient method for general continuous space-time stochastic control problems has been elusive. This paper closes the gap by proposing a proximal gradient algorithm for feedback controls of finite-time horizon stochastic control problems. The state dynamics are continuous time nonlinear diffusions with controlled drift and possibly degenerate noise, and the objectives are nonconvex in the state and nonsmooth in the control. We prove under suitable conditions that the algorithm converges linearly to a stationary point of the control problem, and is stable with respect to policy updates by approximate gradient steps. The convergence result justifies the recent reinforcement learning heuristics that adding entropy regularization to the optimization objective accelerates the convergence of policy gradient methods. The proof exploits careful regularity estimates of backward stochastic differential equations.

show abstract

Section: Standing Assumptions and Main Resultsmentioning

confidence: 55%

Section: Standing Assumptions and Main Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Linear convergence of a policy gradient method for finite horizon continuous time stochastic control problems

Reisinger¹,

Stockinger²,

Zhang³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…These algorithms parametrise the policy as a function of the system state, and update the policy parametrisation based on the gradient of the control objective. Most of the progress, especially the convergence analysis of PG methods, has been in discrete-time Markov decision processes (MDPs) (see e.g., [5,10,18,34,16]). However, most real-world control systems, such as those in aerospace, the automotive industry and robotics, are naturally continuous-time dynamical systems, and hence do not fit in the MDP setting.…”

Section: Introductionmentioning

confidence: 99%

Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems

Giegrich¹,

Reisinger²,

Zhang³

2022

Preprint

View full text Add to dashboard Cite

We study the global linear convergence of policy gradient (PG) methods for finitehorizon exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.

show abstract

Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

Cited by 2 publications

References 8 publications

Linear convergence of a policy gradient method for finite horizon continuous time stochastic control problems

Linear convergence of a policy gradient method for finite horizon continuous time stochastic control problems

Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems

Contact Info

Product

Resources

About