The 2012 International Joint Conference on Neural Networks (IJCNN) 2012
DOI: 10.1109/ijcnn.2012.6252569
|View full text |Cite
|
Sign up to set email alerts
|

A comparison of learning speed and ability to cope without exploration between DHP and TD(0)

Abstract: This is the accepted version of the paper.This version of the publication may differ from the final published version. Abstract-This paper demonstrates the principal motivations for Dual Heuristic Dynamic Programming (DHP) learning methods for use in Adaptive Dynamic Programming and Reinforcement Learning, in continuous state spaces: that of automatic local exploration, improved learning speed and the ability to work without stochastic exploration in deterministic environments. In a simple experiment, the lear… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2012
2012
2016
2016

Publication Types

Select...
4
3

Relationship

5
2

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 18 publications
0
8
0
Order By: Relevance
“…An explicit comparison in learning speed between these two different methods shows the speed up in learning can be of several orders of magnitude when VGL methods are used (Fairbank & Alonso, 2012c), and this confirms the automatic exploration and trajectory bending of VGL.…”
Section: Proof Of Optimalitymentioning
confidence: 64%
“…An explicit comparison in learning speed between these two different methods shows the speed up in learning can be of several orders of magnitude when VGL methods are used (Fairbank & Alonso, 2012c), and this confirms the automatic exploration and trajectory bending of VGL.…”
Section: Proof Of Optimalitymentioning
confidence: 64%
“…But again, due to the large state space, value function approximation is a necessity, violating the assumptions for guaranteed convergence and thus leaving room for asymptotic performance gains as well. The authors in [25,26] have shown that DHP algorithms can eventually find an optimal solution without the explicit need for stochastic exploration, but the value learning algorithms (i.e. TD, TDp0q) could not.…”
Section: Related Workmentioning
confidence: 99%
“…r Ω is a ball of radius r R . By taking the time derivative of the first term with respect to the state trajectories with uptq (see, (37)) and the second term with respect to the perturbed critic estimation error dynamics (23), using (25), substituting the update for the actor (31) and grouping terms together, then (40) becomes,…”
Section: Proof Of Theoremmentioning
confidence: 99%
“…Because HDP is an algorithm which requires stochastic exploration to optimise the ADP/RL problem effectively [32], in the HDP experiment we had to modify (3) to choose exploratory actions. Hence for the HDP experiment we used…”
Section: A Vertical-lander Problemmentioning
confidence: 99%