Fast Policy Learning through Imitation and Reinforcement

Cheng, Ching-An; Yan, Xinyan; Wagener, Nolan; Boots, Byron

doi:10.48550/arxiv.1805.10413

Cited by 19 publications

(23 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This can be improved in the future by considering more general function approximators such as Graph Neural Networks [39] to represent the selector policy. Second, while imitation learning of clairvoyant oracles is effective, the approach may be further improved through reinforcement learning [34,35] since in in practice we do not use the exact oracle but a sub-optimal approximation which means that errors in the oracle will transfer to the learner, limiting performance.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging Experience in Lazy Search

Bhardwaj

Choudhury²,

Boots

et al. 2019

Robotics: Science and Systems XV

Self Cite

View full text Add to dashboard Cite

Lazy graph search algorithms are efficient at solving motion planning problems where edge evaluation is the computational bottleneck. These algorithms work by lazily computing the shortest potentially feasible path, evaluating edges along that path, and repeating until a feasible path is found. The order in which edges are selected is critical to minimizing the total number of edge evaluations: a good edge selector chooses edges that are not only likely to be invalid, but also eliminates future paths from consideration. We wish to learn such a selector by leveraging prior experience. We formulate this problem as a Markov Decision Process (MDP) on the state of the search problem. While solving this large MDP is generally intractable, we show that we can compute oracular selectors that can solve the MDP during training. With access to such oracles, we use imitation learning to find effective policies. If new search problems are sufficiently similar to problems solved during training, the learned policy will choose a good edge evaluation ordering and solve the motion planning problem quickly. We evaluate our algorithms on a wide range of 2D and 7D problems and show that the learned selector outperforms baseline commonly used heuristics.

show abstract

Section: Discussionmentioning

confidence: 99%

“…THOR [34] performs a multi-step search to gain advantage over the reference policy. LOKI [35] switches from IL to RL. Imitation of clairvoyant oracles has been used in multiple domains like information gathering [7], heuristic search [36], and MPC [37,38].…”

Section: Related Workmentioning

confidence: 99%

Leveraging Experience in Lazy Search

Bhardwaj

Choudhury²,

Boots

et al. 2019

Robotics: Science and Systems XV

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many approaches investigate the incorporation of humanprovided demonstrations into policy search to drastically reduce sample complexity via a reasonable initial policy and/or the integration of demonstrations in the learning objective [32,18,6,33,42,41,23].…”

Section: A Policy Searchmentioning

confidence: 99%

“…Supervised approaches for policy learning like Learning From Demonstration (LfD) [2] can encode human prior knowledge by imitating expert examples, but do not support optimization in new environments. Combining RL with LfD is a powerful method for reducing the sample complexity of policy search, and is often used in practice [23,33,42,6]. However, * These authors contributed equally to this work (a) (b) Fig.…”

Section: Introductionmentioning

confidence: 99%

Bootstrapping Motor Skill Learning with Motion Planning

Abbatematteo

Rosen

Tellex

et al. 2021

2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

View full text Add to dashboard Cite

Learning a robot motor skill from scratch is impractically slow; so much so that in practice, learning must be bootstrapped using a good skill policy obtained from human demonstration. However, relying on human demonstration necessarily degrades the autonomy of robots that must learn a wide variety of skills over their operational lifetimes. We propose using kinematic motion planning as a completely autonomous, sample efficient way to bootstrap motor skill learning for object manipulation. We demonstrate the use of motion planners to bootstrap motor skills in two complex object manipulation scenarios with different policy representations: opening a drawer with a dynamic movement primitive representation, and closing a microwave door with a deep neural network policy. We also show how our method can bootstrap a motor skill for the challenging dynamic task of learning to hit a ball off a tee, where a kinematic plan based on treating the scene as static is insufficient to solve the task, but sufficient to bootstrap a more dynamic policy. In all three cases, our method is competitive with human-demonstrated initialization, and significantly outperforms starting with a random policy. This approach enables robots to to efficiently and autonomously learn motor policies for dynamic tasks without human demonstration.

show abstract

“…Batch reinforcement learning in both the tabular and functional approximator settings has long been studied (Lange et al, 2012;Strehl et al, 2010) and continues to be a highly active area of research (Swaminathan & Joachims, 2015;Jiang & Li, 2015;Thomas & Brunskill, 2016;Farajtabar et al, 2018;Irpan et al, 2019;Jaques et al, 2019). Imitation learning is also a well-studied problem (Schaal, 1999;Argall et al, 2009;Hussein et al, 2017) and also continues to be a highly active area of research (Kim et al, 2013;Piot et al, 2014;Chemali & Lazaric, 2015;Hester et al, 2018;Ho et al, 2016;Sun et al, 2017;Cheng et al, 2018;Gao et al, 2018). This paper relates most closely to (Fujimoto et al, 2018a), which made the critical observation that when conventional DQL-based algorithms are employed for batch reinforcement learning, performance can be very poor, with the algorithm possibly not learning at all.…”

Section: Related Workmentioning

confidence: 99%

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

Chen¹,

Zhou²,

Wang³

et al. 2019

Preprint

View full text Add to dashboard Cite

The field of Deep Reinforcement Learning (DRL) has recently seen a surge in research in batch reinforcement learning, which aims for sample-efficient learning from a given data set without additional interactions with the environment. In the batch DRL setting, commonly employed off-policy DRL algorithms can perform poorly and sometimes even fail to learn altogether. In this paper we propose a new algorithm, Best-Action Imitation Learning (BAIL), which unlike many offpolicy DRL algorithms does not involve maximizing Q functions over the action space. Striving for simplicity as well as performance, BAIL first selects from the batch the actions it believes to be high-performing actions for their corresponding states; it then uses those state-action pairs to train a policy network using imitation learning. Although BAIL is simple, we demonstrate that BAIL achieves state of the art performance on the Mujoco benchmark.

show abstract

Fast Policy Learning through Imitation and Reinforcement

Cited by 19 publications

References 26 publications

Leveraging Experience in Lazy Search

Leveraging Experience in Lazy Search

Bootstrapping Motor Skill Learning with Motion Planning

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

Contact Info

Product

Resources

About