2022
DOI: 10.48550/arxiv.2201.07296
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

Abstract: We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (onehidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Fur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…Conditions (ii)-(iv) help to ease the nonconvexity of φ → J(α φ ; ξ 0 ) and to reduce the oscillation of the loss function's curvature, which subsequently promotes the convergence of gradient-based algorithms (see [29]). Condition (ii), along with Example 2.3, also justifies recent reinforcement learning heuristics that adding f-divergences, such as the relative entropy, to the optimization objective can accelerate the convergence of PGMs (see e.g., [34,19]).…”
Section: Standing Assumptions and Main Resultsmentioning
confidence: 55%
See 2 more Smart Citations
“…Conditions (ii)-(iv) help to ease the nonconvexity of φ → J(α φ ; ξ 0 ) and to reduce the oscillation of the loss function's curvature, which subsequently promotes the convergence of gradient-based algorithms (see [29]). Condition (ii), along with Example 2.3, also justifies recent reinforcement learning heuristics that adding f-divergences, such as the relative entropy, to the optimization objective can accelerate the convergence of PGMs (see e.g., [34,19]).…”
Section: Standing Assumptions and Main Resultsmentioning
confidence: 55%
“…As alluded to earlier, the map V A ∋ φ → J(α φ ; ξ 0 ) ∈ R ∪ {∞} is typically nonconvex and may not satisfy the Polyak-Lojasiewicz condition as in the setting with parametric policies ( [10,38,26,11,13,19]). Hence to ensure the linear convergence of the PPGM (1.7), we impose further conditions on the coefficients which guarantee that we are in one of the following five cases:…”
Section: Standing Assumptions and Main Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…These algorithms parametrise the policy as a function of the system state, and update the policy parametrisation based on the gradient of the control objective. Most of the progress, especially the convergence analysis of PG methods, has been in discrete-time Markov decision processes (MDPs) (see e.g., [5,10,18,34,16]). However, most real-world control systems, such as those in aerospace, the automotive industry and robotics, are naturally continuous-time dynamical systems, and hence do not fit in the MDP setting.…”
Section: Introductionmentioning
confidence: 99%