Massive datasets have proven critical to successfully applying deep learning to realworld problems, catalyzing progress on tasks such as object recognition, speech transcription, and machine translation. In this work, we study an analogous problem within reinforcement learning: can we enable an agent to leverage large, diverse experiences from previous tasks in order to quickly learn a new task? While recent work has shown some promise towards offline reinforcement learning, considerably less work has studied how we might leverage offline behavioral data when transferring to new tasks. To address this gap, we consider the problem setting of offline meta-reinforcement learning. By nature of being offline, algorithms for offline meta-RL can utilize the largest possible pool of training data available, and eliminate potentially unsafe or costly data collection during meta-training. Targeting this setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW), an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both inner-loop adaptation and outer-loop meta-learning. To our knowledge, MACAW is the first successful combination of gradient-based metalearning and value-based reinforcement learning. We empirically find that this approach enables fully offline meta-reinforcement learning and achieves notable gains over prior methods in some settings.Preprint. Under review.
Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling onpolicy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at https://sites.google.com/view/variational-mail.
We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations. Compared with the more commonly used global third-person perspective, a hand-centric (eye-in-hand) perspective affords reduced observability, but we find that it consistently improves training efficiency and out-of-distribution generalization. These benefits hold across a variety of learning algorithms, experimental settings, and distribution shifts, and for both simulated and real robot apparatuses. However, this is only the case when hand-centric observability is sufficient; otherwise, including a third-person perspective is necessary for learning, but also harms outof-distribution generalization. To mitigate this, we propose to regularize the thirdperson information stream via a variational information bottleneck. On six representative manipulation tasks with varying hand-centric observability adapted from the Meta-World benchmark, this results in a state-of-the-art reinforcement learning agent operating from both perspectives improving its out-of-distribution generalization on every task. While some practitioners have long put cameras in the hands of robots, our work systematically analyzes the benefits of doing so and provides simple and broadly applicable insights for improving end-to-end learned vision-based robotic manipulation. 1 Figure 1: Illustration suggesting the role that visual perspective can play in facilitating the acquisition of symmetries with respect to certain transformations on the world state s. T0: planar translation of the end-effector and cube. T1: vertical translation of the table surface, end-effector, and cube. T2: addition of distractor objects. O3: third-person perspective. O h : hand-centric perspective.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.