2013
DOI: 10.2478/pjbr-2013-0003
|View full text |Cite
|
Sign up to set email alerts
|

Robot Skill Learning: From Reinforcement Learning to Evolution Strategies

Abstract: Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. Owing to current trends involving searching in parameter space (rather than action space) and using reward-weighted averaging (rather than gradient estimation), reinforcement learning algorithms for policy improvement, e.g. PoWER and PI 2 , are now able to learn sophisticated high-dimensional robot skills. A side-effect of these trends has been that, over the last 15 years, reinforcement learning (RL) alg… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
99
0

Year Published

2014
2014
2019
2019

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 101 publications
(99 citation statements)
references
References 25 publications
0
99
0
Order By: Relevance
“…the known analytic objectives to optimize the non-stationary policy. In our experiments we compare to CMA, which has been shown to be closely related to PI2 [23].…”
Section: Related Work a Policy Search In Roboticsmentioning
confidence: 99%
See 1 more Smart Citation
“…the known analytic objectives to optimize the non-stationary policy. In our experiments we compare to CMA, which has been shown to be closely related to PI2 [23].…”
Section: Related Work a Policy Search In Roboticsmentioning
confidence: 99%
“…CMA has been used previously to learn robot skills [4,20,23]. We applied both methods on the same return function R(θ ) over 100 iterations.…”
Section: B Opening a Door With A Pr2mentioning
confidence: 99%
“…The optimization algorithm we use is PI BB , short for "Policy Improvement with Black-Box optimization" [2]. The PI BB algorithm is explained and visualized in Fig.…”
Section: Bbmentioning
confidence: 99%
“…If we ignore the costs at individual time steps r t , and only use the return of an episode R = T t=1 r t , policy improvement is equivalent to black-box optimization [2], where the black-box cost function J: Θ → R takes θ as an input, and returns the scalar return of the episode R, as in (1). Each evaluation of J thus corresponds to one episode, or rollout.…”
Section: Formalizationmentioning
confidence: 99%
See 1 more Smart Citation