Batch Exploration With Examples for Scalable Robotic Reinforcement Learning

Chen, Annie S.; Nam, Hyunji Alex; Nair, Suraj; Finn, Chelsea

doi:10.1109/lra.2021.3068655

Cited by 16 publications

(19 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For adversarial learning techniques, we use gradient penalty (GP) to avoid over-fitting the discriminator. Other options for discriminator regularization techniques include spectral normalization [36], Mixup [37,38], and PUGAIL [39], however we chose GP as it has been empirically shown to achieve decent performance across multiple tasks [40,41].…”

Section: Methodsmentioning

confidence: 99%

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

Hoshino¹,

Ota²,

Kanezaki³

et al. 2021

Preprint

View full text Add to dashboard Cite

Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highly expensive. To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a stationary reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence. We demonstrate that our method is considerably more sample efficient and generalizes to novel environments through the experiments. Our method achieves better or comparable results on policy performance baselines with significantly fewer interactions. Furthermore, we empirically show that the recovered reward function generalizes to different tasks where prior arts are prone to fail.

show abstract

Section: Methodsmentioning

confidence: 99%

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

Hoshino¹,

Ota²,

Kanezaki³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Much like our work, a number of prior works have studied how learning from broad datasets can enhance generalization in robot learning [16,33,56,13,22,24,10,5]. These works have largely studied the problem of collecting large and diverse robotic datasets in scalable ways [28,22,10,53,7] as well as techniques for learning general purpose policies from this style of data in an offline [13,5] or online [33,29,24] fashion. While our motivation of achieving generalization by learning from diverse data heavily overlaps with the above works, our approach fundamentally differs in that it aims to sidestep the challenges associated with collecting diverse robotic data by instead leveraging existing human data sources.…”

Section: Robotic Learning From Large Datasetsmentioning

confidence: 99%

Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos

Chen¹,

Nair²,

Finn³

2021

Robotics: Science and Systems XVII

Self Cite

View full text Add to dashboard Cite

We are motivated by the goal of generalist robots that can complete a wide range of tasks across many environments. Critical to this is the robot's ability to acquire some metric of task success or reward, which is necessary for reinforcement learning, planning, or knowing when to ask for help. For a general-purpose robot operating in the real world, this reward function must also be able to generalize broadly across environments, tasks, and objects, while depending only on on-board sensor observations (e.g. RGB images). While deep learning on large and diverse datasets has shown promise as a path towards such generalization in computer vision and natural language, collecting high quality datasets of robotic interaction at scale remains an open challenge. In contrast, "in-the-wild" videos of humans (e.g. YouTube) contain an extensive collection of people doing interesting tasks across a diverse range of settings. In this work, we propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task, and can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. We find that by leveraging diverse human datasets, this reward function (a) can generalize zero shot to unseen environments, (b) generalize zero shot to unseen tasks, and (c) can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.

show abstract

“…Alternatively (Lee et al, 2020;Wayne et al, 2018) train a variational latent space model, but use it only as a filter and train a separate policy on top of the learned latent representation. Model-based RL learns a dynamics model either in the pixel space (Finn and Levine, 2017;Ebert et al, 2018) or in a latent space Watter et al, 2015;Banijamali et al, 2018;Hafner et al, 2019Ha and Schmidhuber, 2018;Kipf et al, 2019;Chen et al, 2020) and can either learn a policy within the model or deploy shooting-based planning methods. However, most of those prior works rely critically on online data collection to be successful.…”

Section: Related Workmentioning

confidence: 99%

“…However, most of those prior works rely critically on online data collection to be successful. Visual foresight algorithms (Finn and Levine, 2017;Ebert et al, 2018;Suh and Tedrake, 2020;Yen-Chen et al, 2019;Chen et al, 2020) handle control from pixels in a fully offline setting, but do not explicitly tackle the distributional shift issue that arises; meanwhile, our method is designed to specifically address this. As a result, we find in Section 5.2 that our approach significantly outperforms visual foresight.…”

Section: Related Workmentioning

confidence: 99%

Offline Reinforcement Learning from Images with Latent Space Models

Rafailov,

Yu,

Rajeswaran

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions. Offline RL enables extensive use and re-use of historical datasets, while also alleviating safety concerns associated with online exploration, thereby expanding the real-world applicability of RL. Most prior work in offline RL has focused on tasks with compact state representations. However, the ability to learn directly from rich observation spaces like images is critical for real-world applications such as robotics. In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces. Model-based offline RL algorithms have achieved state of the art results in state based tasks and have strong theoretical guarantees. However, they rely crucially on the ability to quantify uncertainty in the model predictions, which is particularly challenging with image observations. To overcome this challenge, we propose to learn a latent-state dynamics model, and represent the uncertainty in the latent space. Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP. In experiments on a range of challenging image-based locomotion and manipulation tasks, we find that our algorithm significantly outperforms previous offline model-free RL methods as well as state-of-the-art online visual model-based RL methods. Moreover, we also find that our approach excels on an imagebased drawer closing task on a real robot using a pre-existing dataset. All results including videos can be found online at https://sites.google.com/view/lompo/.

show abstract

Batch Exploration With Examples for Scalable Robotic Reinforcement Learning

Cited by 16 publications

References 33 publications

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching

Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos

Offline Reinforcement Learning from Images with Latent Space Models

Contact Info

Product

Resources

About