2012
DOI: 10.1016/j.neunet.2011.09.005
|View full text |Cite
|
Sign up to set email alerts
|

Analysis and improvement of policy gradient estimation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
108
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 78 publications
(112 citation statements)
references
References 11 publications
4
108
0
Order By: Relevance
“…However, a classic policy gradient method called REINFORCE [24] tends to produce gradient estimates with large variance, which results in unreliable policy improvement [13]. More theoretically, it was shown that the variance of policy gradients can be proportional to the length of an agent's trajectory, due to the stochasticity of policies [25]. This can be a critical limitation in RL problems with long trajectories.…”
Section: Policy Iteration Vs Policy Searchmentioning
confidence: 99%
See 3 more Smart Citations
“…However, a classic policy gradient method called REINFORCE [24] tends to produce gradient estimates with large variance, which results in unreliable policy improvement [13]. More theoretically, it was shown that the variance of policy gradients can be proportional to the length of an agent's trajectory, due to the stochasticity of policies [25]. This can be a critical limitation in RL problems with long trajectories.…”
Section: Policy Iteration Vs Policy Searchmentioning
confidence: 99%
“…Then, instead of policy parameters, hyperparameters included in the prior distribution are learned from data. Thanks to this priorbased formulation, the variance of gradient estimates in PGPE is independent of the length of an agent's trajectory [25]. However, PGPE still suffers from an instability problem in small sample cases.…”
Section: Policy Iteration Vs Policy Searchmentioning
confidence: 99%
See 2 more Smart Citations
“…However, REINFORCE samples a random action from the stochastic policy at each time step. As a result, the gradient estimate has large variance even if the optimal baseline is subtracted (Zhao et al, 2012). To reduce the gradient’s variance, the Policy Gradients with Parameter-based Exploration (PGPE) (Sehnke et al, 2010) uses a deterministic policy and optimizes the parameters of a prior distribution of the deterministic policy parameters.…”
Section: Introductionmentioning
confidence: 99%