Composable Deep Reinforcement Learning for Robotic Manipulation

Haarnoja, Tuomas; Pong, Vitchyr; Zhou, Aurick; Dalal, Murtaza; Abbeel, Pieter; Levine, Sergey

doi:10.1109/icra.2018.8460756

Cited by 778 publications

(1,252 citation statements)

References 28 publications

Supporting

Mentioning

1,243

Contrasting

Unclassified

Order By: Relevance

“…which incentivizes the policy to explore more widely improving its robustness against perturbations [16]. The temperature parameter α determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy.…”

Section: Reinforcement Learning Preliminariesmentioning

confidence: 99%

“…Learning robotic tasks in the real world requires an algorithm that is sample efficient, robust, and insensitive to the choice of the hyperparameters. Maximum entropy RL is both sample efficient and robust, making it a good candidate for real-world robot learning [16]. However, one of the major challenges of maximum entropy RL is its sensitivity to the temperature parameter, which typically needs to be tuned for each task separately.…”

Section: Automating Entropy Adjustment For Maximum Entropy Rlmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Walk Via Deep Reinforcement Learning

Haarnoja¹,

Ha²,

Zhou³

et al. 2019

Robotics: Science and Systems XV

Self Cite

302

237

View full text Add to dashboard Cite

Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website 3 .

show abstract

Section: Reinforcement Learning Preliminariesmentioning

confidence: 99%

Section: Automating Entropy Adjustment For Maximum Entropy Rlmentioning

confidence: 99%

Learning to Walk Via Deep Reinforcement Learning

Haarnoja¹,

Ha²,

Zhou³

et al. 2019

Robotics: Science and Systems XV

Self Cite

302

237

View full text Add to dashboard Cite

show abstract

“…Recent work in RL for manipulation has tended to take a more tabula rasa approach, focusing on learning policies that output joint torques directly or that output position (and velocity) references to an underlying PD controller. Direct torque control has been used to learn many physical and simulated tasks, including peg insertion, placing a coat hanger, hammering, screwing a bottle cap [6], door opening, pick and place tasks [5], and Lego stacking tasks [20]. Learning position and/or velocity references to a fixed PD joint controller has been used for tasks such as door opening, arXiv:1908.08659v1 [cs.RO] 23 Aug 2019 hammering, object placement [21], Lego stacking [7], and in-hand manipulation [1].…”

Section: Introductionmentioning

confidence: 99%

A Comparison of Action Spaces for Learning Manipulation Tasks

Varin

Grossman

Kuindersma

2019

2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

View full text Add to dashboard Cite

Designing reinforcement learning (RL) problems that can produce delicate and precise manipulation policies requires careful choice of the reward function, state, and action spaces. Much prior work on applying RL to manipulation tasks has defined the action space in terms of direct joint torques or reference positions for a joint-space proportional derivative (PD) controller. In practice, it is often possible to add additional structure by taking advantage of model-based controllers that support both accurate positioning and control of the dynamic response of the manipulator. In this paper, we evaluate how the choice of action space for dynamic manipulation tasks affects the sample complexity as well as the final quality of learned policies. We compare learning performance across three tasks (peg insertion, hammering, and pushing), four action spaces (torque, joint PD, inverse dynamics, and impedance control), and using two modern reinforcement learning algorithms (Proximal Policy Optimization and Soft Actor-Critic). Our results lend support to the hypothesis that learning references for a task-space impedance controller significantly reduces the number of samples needed to achieve good performance across all tasks and algorithms.

show abstract

“…It is, also, possible to incorporate optimization layers (e.g., a QP program [144]) in a neural network in order to take advantage of the structure they provide. Lastly, one can learn distinct soft policies for simpler tasks and then compose them in order to achieve a more complicated task [65].…”

Section: Generalization and Robustnessmentioning

confidence: 99%

A Survey on Policy Search Algorithms for Learning Robot Controllers in a Handful of Trials

et al. 2020

View full text Add to dashboard Cite

Most policy search algorithms require thousands of training episodes to find an effective policy, which is often infeasible with a physical robot. This survey article focuses on the extreme other end of the spectrum: how can a robot adapt with only a handful of trials (a dozen) and a few minutes? By analogy with the word "big-data", we refer to this challenge as "micro-data reinforcement learning". We show that a first strategy is to leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators). A second strategy is to create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based policy search), so that the policy optimizer queries the model instead of the real system. Overall, all successful micro-data algorithms combine these two strategies by varying the kind of model and prior knowledge. The current scientific challenges essentially revolve around scaling up to complex robots, designing generic priors, and optimizing the computing time.1 In some rare cases, a process can be "optimally efficient". 2 It is challenging to put a precise limit for "micro-data learning" as each domain has different experimental constraints, this is why we will refer in this article to "a few minutes" or a "a few trials". The commonly used word "big-data" has a similar "fuzzy" limit that depends on the exact domain. 3 Planning-based and model-predictive control [59] methods do not search for policy parameters, this is why they do not fit into the scope of this paper. Although trajectory-based policies and planning-based methods share the same goal, they usually search in a different space: planning algorithms search in the state-action space (e.g., joint positions/velocities), whereas policy methods will search for the optimal parameters of the policy, which can encode a Chatzilygeroudis, Vassiliades, Stulp, Calinon and Mouret arXiv | 1 arXiv:1807.02303v4 [cs.RO] 31 May 2019 Chatzilygeroudis, Vassiliades, Stulp, Calinon and Mouret arXiv | 2This is basically sampling the distribution over trajectories, P (τ |θ), which is feasible since the sampling is performed with the models. When applying the same policy (i.e., a policy with the same parameters θ), the trajectories τ (and consequently r)Chatzilygeroudis, Vassiliades, Stulp, Calinon and Mouret arXiv | 9

show abstract

Composable Deep Reinforcement Learning for Robotic Manipulation

Cited by 778 publications

References 28 publications

Learning to Walk Via Deep Reinforcement Learning

Learning to Walk Via Deep Reinforcement Learning

A Comparison of Action Spaces for Learning Manipulation Tasks

A Survey on Policy Search Algorithms for Learning Robot Controllers in a Handful of Trials

Contact Info

Product

Resources

About