Robot Skill Learning: From Reinforcement Learning to Evolution Strategies

Stulp, Freek; Sigaud, Olivier

doi:10.2478/pjbr-2013-0003

Cited by 101 publications

(99 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…the known analytic objectives to optimize the non-stationary policy. In our experiments we compare to CMA, which has been shown to be closely related to PI2 [23].…”

Section: Related Work a Policy Search In Roboticsmentioning

confidence: 99%

See 1 more Smart Citation

Combined Optimization and Reinforcement Learning for Manipulation Skills

Englert

Toussaint

Robotics: Science and Systems XII

View full text Add to dashboard Cite

Abstract-This work addresses the problem of how a robot can improve a manipulation skill in a sample-efficient and secure manner. As an alternative to the standard reinforcement learning formulation where all objectives are defined in a single reward function, we propose a generalized formulation that consists of three components: 1) A known analytic control cost function; 2) A black-box return function; and 3) A black-box binary success constraint. While the overall policy optimization problem is highdimensional, in typical robot manipulation problems we can assume that the black-box return and constraint only depend on a lower-dimensional projection of the solution. With our formulation we can exploit this structure for a sample-efficient learning framework that iteratively improves the policy with respect to the objective functions under the success constraint. We employ efficient 2nd-order optimization methods to optimize the high-dimensional policy w.r.t. the analytic cost function while keeping the lower dimensional projection fixed. This is alternated with safe Bayesian optimization over the lower-dimensional projection to address the black-box return and success constraint. During both improvement steps the success constraint is used to keep the optimization in a secure region and to clearly distinguish between motions that lead to success or failure. The learning algorithm is evaluated on a simulated benchmark problem and a door opening task with a PR2.

show abstract

“…the known analytic objectives to optimize the non-stationary policy. In our experiments we compare to CMA, which has been shown to be closely related to PI2 [23].…”

Section: Related Work a Policy Search In Roboticsmentioning

confidence: 99%

“…CMA has been used previously to learn robot skills [4,20,23]. We applied both methods on the same return function R(θ ) over 100 iterations.…”

Section: B Opening a Door With A Pr2mentioning

confidence: 99%

Combined Optimization and Reinforcement Learning for Manipulation Skills

Englert

Toussaint

Robotics: Science and Systems XII

View full text Add to dashboard Cite

show abstract

“…The optimization algorithm we use is PI BB , short for "Policy Improvement with Black-Box optimization" [2]. The PI BB algorithm is explained and visualized in Fig.…”

Section: Bbmentioning

confidence: 99%

“…If we ignore the costs at individual time steps r t , and only use the return of an episode R = T t=1 r t , policy improvement is equivalent to black-box optimization [2], where the black-box cost function J: Θ → R takes θ as an input, and returns the scalar return of the episode R, as in (1). Each evaluation of J thus corresponds to one episode, or rollout.…”

Section: Formalizationmentioning

confidence: 99%

“…In this paper, we present "Simultaneous On-line Discovery and Improvement of Robotic Skills" (SODIRS), an algorithm that is able to autonomously learn skill options for task variations through: Optimization -Using policy improvement with covariance matrix adaptation to optimize skill options [2]. Discovery -Detecting task variations based on clustering the costs of recent trials.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Simultaneous on-line Discovery and Improvement of Robotic Skill options

Stulp

Herlant

Hoarau

et al. 2014

2014 IEEE/RSJ International Conference on Intelligent Robots and Systems

Self Cite

View full text Add to dashboard Cite

Abstract-The regularity of everyday tasks enables us to reuse existing solutions for task variations. For instance, most door-handles require the same basic skill (reach, grasp, turn, pull), but small adaptations of the basic skill are required to adapt to the variations that exist (e.g. levers vs. knobs). We introduce the algorithm "Simultaneous On-line Discovery and Improvement of Robotic Skills" (SODIRS) that is able to autonomously discover and optimize skill options for such task variations. We formalize the problem in a reinforcement learning context, and use the PI BB algorithm [2] to continually optimize skills with respect to a cost function. SODIRS discovers new subskills, or "skill options", by clustering the costs of trials, and determining whether perceptual features are able to predict which cluster a trial will belong to. This enables SODIRS to build a decision tree, in which the leaves contain skill options for task variations. We demonstrate SODIRS' performance in simulation, as well as on a Meka humanoid robot performing the ball-in-cup task.

show abstract