Learning compound multi-step controllers under unknown dynamics

Han, Weiqiao; Levine, Sergey; Abbeel, Pieter

doi:10.1109/iros.2015.7354297

Cited by 33 publications

(45 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baselines. We evaluate forward-backward RL (FBRL) (Han et al, 2015;Eysenbach et al, 2017), a perturbation controller (R3L) (Zhu et al, 2020), value-accelerated persistent RL (VaPRL) (Sharma et al, 2021), a comparison to simply running the base RL algorithm with the biased TD update discussed in Section 6.1 (naïve RL), and finally an oracle (oracle RL) where resets are provided are provided every H E steps (H T is typically three orders of magnitude larger than H E ). We benchmark VaPRL only when demonstrations are available, in accordance to the proposed algorithm in Sharma et al (2021).…”

Section: Evaluation: Setup Metrics Baselines and Resultsmentioning

confidence: 99%

“…Reset-free RL has been studied by previous works with a focus on safety (Eysenbach et al, 2017), automated and unattended learning in the real world (Han et al, 2015;Zhu et al, 2020;, skill discovery Lu et al, 2020), and providing a curriculum (Sharma et al, 2021). Strategies to learn reset-free behavior include directly learning a backward reset controller (Eysenbach et al, 2017), learning a set of auxillary tasks that can serve as an approximate reset (Ha et al, 2020;, or using a novelty seeking reset controller (Zhu et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

“…Prior works on reset-free reinforcement learning can be encapsulated within this framework as different choices for surrogate reward function rt . Some pertinent examples: Assuming r ρ is some reward function designed that shifts the agent's state distribution towards initial state distribution ρ, alternating rt = r and rt = r ρ for a fixed number of environment steps recovers the forward-backward reinforcement learning algorithms proposed by Han et al (2015); Eysenbach et al (2017). Similarly, R3L (Zhu et al, 2020) can be understood as alternating between a perturbation controller optimizing a state novelty reward and the forward controller optimizing the task reward r. Recent work on using multi-task learning for reset-free reinforcement learning can be understood as choosing rt (s t , a t ) = K k=1 r k (s t , a t )I[s t ∈ S k ] such that S 1 , .…”

Section: A3 Relating Prior Work To Arlmentioning

confidence: 99%

“…The first is a setting where the agent first trains in a non-episodic environment, and is then "deployed" into an episodic test environment. In this setting, which is most commonly studied in prior works on "resetfree" learning (Han et al, 2015;Zhu et al, 2020;Sharma et al, 2021), the goal is to learn the best possible episodic policy after a period of non-episodic training. For instance, in the case of a home cleaning robot, this would correspond to evaluating its ability to clean a messy home.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Autonomous Reinforcement Learning: Formalism and Benchmarking

Sharma¹,

Xu²,

Sardana³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Reinforcement learning (RL) provides a naturalistic framing for learning through trial and error, which is appealing both because of its simplicity and effectiveness and because of its resemblance to how humans and animals acquire skills through experience. However, real-world embodied learning, such as that performed by humans and animals, is situated in a continual, non-episodic world, whereas common benchmark tasks in RL are episodic, with the environment resetting between trials to provide the agent with multiple attempts. This discrepancy presents a major challenge when attempting to take RL algorithms developed for episodic simulated environments and run them on real-world platforms, such as robots. In this paper, we aim to address this discrepancy by laying out a framework for Autonomous Reinforcement Learning (ARL): reinforcement learning where the agent not only learns through its own experience, but also contends with lack of human supervision to reset between trials. We introduce a simulated benchmark EARL 1 around this framework, containing a set of diverse and challenging simulated tasks reflective of the hurdles introduced to learning when only a minimal reliance on extrinsic intervention can be assumed. We show that standard approaches to episodic RL and existing approaches struggle as interventions are minimized, underscoring the need for developing new algorithms for reinforcement learning with a greater focus on autonomy.

show abstract

Section: Evaluation: Setup Metrics Baselines and Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: A3 Relating Prior Work To Arlmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Autonomous Reinforcement Learning: Formalism and Benchmarking

Sharma¹,

Xu²,

Sardana³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Reinforcement Learning (RL) is increasingly popular in robotics as it facilitates learning control policies through exploration [17,24,36,11,35,40,9,37,23,10]. However, it is well known that efficacy of RL algorithms is highly dependent on how reward functions are specified [32].…”

Section: Introductionmentioning

confidence: 99%

SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards

Krishnan

Garg

Liaw

et al. 2018

The International Journal of Robotics Research

View full text Add to dashboard Cite

Reinforcement Learning (RL) struggles in problems with delayed rewards, and one approach is to segment the task into sub-tasks with incremental rewards. We propose a framework called Hierarchical Inverse Reinforcement Learning (HIRL), which is a model for learning sub-task structure from demonstrations. HIRL decomposes the task into sub-tasks based on transitions that are consistent across demonstrations. These transitions are defined as changes in local linearity w.r.t to a kernel function [21]. Then, HIRL uses the inferred structure to learn reward functions local to the sub-tasks but also handle any global dependencies such as sequentiality.We have evaluated HIRL on several standard RL benchmarks: Parallel Parking with noisy dynamics, Two-Link Pendulum, 2D Noisy Motion Planning, and a Pinball environment. In the parallel parking task, we find that rewards constructed with HIRL converge to a policy with an 80% success rate in 32% fewer time-steps than those constructed with Maximum Entropy Inverse RL (MaxEnt IRL), and with partial state observation, the policies learned with IRL fail to achieve this accuracy while HIRL still converges. We further find that that the rewards learned with HIRL are robust to environment noise where they can tolerate 1 stdev. of random perturbation in the poses in the environment obstacles while maintaining roughly the same convergence rate. We find that HIRL rewards can converge up-to 6× faster than rewards constructed with IRL.

show abstract