Analysis and improvement of policy gradient estimation

Zhao, Tingting; Hachiya, Hirotaka; Niu, Gang; Sugiyama, Masashi

doi:10.1016/j.neunet.2011.09.005

Cited by 78 publications

(112 citation statements)

References 11 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…However, a classic policy gradient method called REINFORCE [24] tends to produce gradient estimates with large variance, which results in unreliable policy improvement [13]. More theoretically, it was shown that the variance of policy gradients can be proportional to the length of an agent's trajectory, due to the stochasticity of policies [25]. This can be a critical limitation in RL problems with long trajectories.…”

Section: Policy Iteration Vs Policy Searchmentioning

confidence: 99%

“…Then, instead of policy parameters, hyperparameters included in the prior distribution are learned from data. Thanks to this priorbased formulation, the variance of gradient estimates in PGPE is independent of the length of an agent's trajectory [25]. However, PGPE still suffers from an instability problem in small sample cases.…”

Section: Policy Iteration Vs Policy Searchmentioning

confidence: 99%

“…For this reason, policy update by REINFORCE tends to be unreliable [13]. In particular, the variance of gradient estimates in REINFORCE can be proportional to the length of the history, T , due to the stochasticity of policies [25]. This can be a critical limitation when the history is long.…”

Section: Reinforcementioning

confidence: 99%

“…Based on these paired samples, an empirical estimator of the above gradient (with baseline subtraction) is given as follows [25]:…”

Section: Policy Gradients With Parameter-based Exploration (Pgpe)mentioning

confidence: 99%

See 3 more Smart Citations

Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation

et al. 2014

Self Cite

View full text Add to dashboard Cite

The goal of reinforcement learning (RL) is to let an agent learn an optimal control policy in an unknown environment so that future expected rewards are maximized. The model-free RL approach directly learns the policy based on data samples. Although using many samples tends to improve the accuracy of policy learning, collecting a large number of samples is often expensive in practice. On the other hand, the model-based RL approach first estimates the transition model of the environment and then learns the policy based on the estimated transition model. Thus, if the transition model is accurately learned from a small amount of data, the model-based approach can perform better than the model-free approach. In this paper, we propose a novel model-based RL method by combining a recently proposed model-free policy search method called policy gradients with parameter-based exploration and the state-of-the-art transition model estimator called least-squares conditional density estimation. Through experiments, we demonstrate the practical usefulness of the proposed method.

show abstract

Section: Policy Iteration Vs Policy Searchmentioning

confidence: 99%

Section: Policy Iteration Vs Policy Searchmentioning

confidence: 99%

Section: Reinforcementioning

confidence: 99%

“…Based on these paired samples, an empirical estimator of the above gradient (with baseline subtraction) is given as follows [25]:…”

Section: Policy Gradients With Parameter-based Exploration (Pgpe)mentioning

confidence: 99%

See 2 more Smart Citations

Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation

et al. 2014

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, REINFORCE samples a random action from the stochastic policy at each time step. As a result, the gradient estimate has large variance even if the optimal baseline is subtracted (Zhao et al, 2012). To reduce the gradient’s variance, the Policy Gradients with Parameter-based Exploration (PGPE) (Sehnke et al, 2010) uses a deterministic policy and optimizes the parameters of a prior distribution of the deterministic policy parameters.…”

Section: Introductionmentioning

confidence: 99%

Adaptive Baseline Enhances EM-Based Policy Search: Validation in a View-Based Positioning Task of a Smartphone Balancer

2017

View full text Add to dashboard Cite

EM-based policy search methods estimate a lower bound of the expected return from the histories of episodes and iteratively update the policy parameters using the maximum of a lower bound of expected return, which makes gradient calculation and learning rate tuning unnecessary. Previous algorithms like Policy learning by Weighting Exploration with the Returns, Fitness Expectation Maximization, and EM-based Policy Hyperparameter Exploration implemented the mechanisms to discard useless low-return episodes either implicitly or using a fixed baseline determined by the experimenter. In this paper, we propose an adaptive baseline method to discard worse samples from the reward history and examine different baselines, including the mean, and multiples of SDs from the mean. The simulation results of benchmark tasks of pendulum swing up and cart-pole balancing, and standing up and balancing of a two-wheeled smartphone robot showed improved performances. We further implemented the adaptive baseline with mean in our two-wheeled smartphone robot hardware to test its performance in the standing up and balancing task, and a view-based approaching task. Our results showed that with adaptive baseline, the method outperformed the previous algorithms and achieved faster, and more precise behaviors at a higher successful rate.

show abstract

Policy search for active fault diagnosis with partially observable state

Král

Punčochář

2022

Adaptive Control & Signal

View full text Add to dashboard Cite

Summary The article deals with a novel design of an active fault detector (AFD) for a nonlinear stochastic system with a partially observable state. The imperfect state information problem is converted to a perfect state information problem using a state estimator. Subsequently, the problem is decomposed into separate tasks of an optimal fault detector design and an approximate input generator design using a dynamic programming technique. While the former task is straightforward, the latter represents a nonlinear functional optimization problem. The input generator is approximated by a multi‐layer perceptron neural network, and its unknown parameters are found using the policy search method. Effectiveness of the proposed AFD design is demonstrated numerically on a pendulum system and a heating/cooling system.

show abstract

Analysis and improvement of policy gradient estimation

Cited by 78 publications

References 11 publications

Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation

Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation

Adaptive Baseline Enhances EM-Based Policy Search: Validation in a View-Based Positioning Task of a Smartphone Balancer

Policy search for active fault diagnosis with partially observable state

Contact Info

Product

Resources

About