2021
DOI: 10.48550/arxiv.2110.11280
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Actor-critic is implicitly biased towards high entropy optimal policies

Abstract: We show that the simplest actor-critic method -a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or explorationdoes not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ǫ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 28 publications
0
5
0
Order By: Relevance
“…Note that the previous convergence recovers (7) when taking ρ = ν π * . The same procedure can also be directly applied to adapt the analysis of the SHPMD method.…”
Section: Dropping Assumptionmentioning
confidence: 53%
See 2 more Smart Citations
“…Note that the previous convergence recovers (7) when taking ρ = ν π * . The same procedure can also be directly applied to adapt the analysis of the SHPMD method.…”
Section: Dropping Assumptionmentioning
confidence: 53%
“…Theorem 3.2 is also related to [7], which shows that the actor-critic method produces policy iterate that has bounded Kullback-Leibler (KL) divergence to the optimal policy with maximal entropy π * U . Our differences mainly exists in the following aspects: (1) We study general discounted MDP with finite state and action.…”
Section: Local Superlinear Convergence and Implicit Regularizationmentioning
confidence: 99%
See 1 more Smart Citation
“…We present the proof of Lemma 5.1 in Appendix B.1. Lemma 5.1 describes a relationship between any two policies and a policy belong to the Bregman projected policy class associated to F Θ and h. While similar results have been obtained and exploited for the tabular setting (Xiao, 2022) and for the negative entropy mirror map (Liu et al, 2019;Hu et al, 2021), Lemma 5.1 is the first to allow any parametrization class F Θ and any choice of mirror map. Since Lemma 5.1 does not depend on Algorithm 1, we expect it to be helpful in contexts outside this work.…”
Section: Theoretical Analysismentioning
confidence: 63%
“…for all s ∈ S. In this example, AMPO recovers tabular NPG (Shani et al, 2020) and NPG with log-linear polices (Hu et al, 2021) when f θ (s, a) = θ s,a and when f θ and Q t are linear functions for all t ≥ 0, respectively. We refer to Appendix A.1 for details and an extension to Tsallis entropy.…”
Section: Example 44 (Npg) If H Is the Negative Entropymentioning
confidence: 98%