Unsupervised Control Through Non-Parametric Discriminative Rewards

Warde-Farley, David; Wiele, Tom Van de; Kulkarni, Tejas; Ionescu, Catalin; Hansen, S. M.; Mnih, Volodymyr

doi:10.48550/arxiv.1811.11359

Cited by 20 publications

(31 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our method learns distances between full image states (3072-dimensional) while HER uses 3-dimensional goals, a difference of two orders of magnitude in dimensionality. This difficulty of learning complex image-based goals is further corroborated in prior work Nair et al, 2018;Pong et al, 2019;Warde-Farley et al, 2018).…”

Section: Vision-based Real-world Manipulation From Human Preferencessupporting

confidence: 66%

“…Our method is also well suited for fully unsupervised learning, in which case DDL uses the distance function to propose goals for unsupervised skill discovery. Prior work on unsupervised reinforcement learning has proposed choosing goals based on a variety of unsupervised criteria, typically with the aim of attaining broad state coverage (Nair et al, 2018;Florensa et al, 2018;Eysenbach et al, 2018;Warde-Farley et al, 2018;Pong et al, 2019). Our method instead repeatedly chooses the most distant state as the goal, which produces rapid exploration and quickly discovers relatively complex skills.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

Hartikainen¹,

Geng²,

Haarnoja³

et al. 2019

Preprint

View full text Add to dashboard Cite

Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/dynamical-distance-learning.

show abstract

Section: Vision-based Real-world Manipulation From Human Preferencessupporting

confidence: 66%

Section: Related Workmentioning

confidence: 99%

Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

Hartikainen¹,

Geng²,

Haarnoja³

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…With the novelty and potential measures, we develop a subgoal selection strategy to improve exploration in HRL. There are numerous goal selection strategies in the multi-goal RL domain [42] as well, including sampling diverse goals uniformly from a buffer [43,44], sampling goals from the achieved goal distribution [45], sampling goals of intermediate difficulty from a generative model [46], and selecting goals in sparsely explored areas [47]. However, those methods use predefined or pretrained goal spaces.…”

Section: Related Workmentioning

confidence: 99%

Active Hierarchical Exploration with Stable Subgoal Representation Learning

Liu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.Preprint. Under review.

show abstract

“…A large body of work focusing on online skill discovery have been proposed as a means to improve exploration and sample complexity in online RL. For instance, Eysenbach et al (2018); Sharma et al (2019); Gregor et al (2016); Warde-Farley et al (2018); Liu et al (2021) propose to learn a diverse set of skills by maximizing an information theoretic objective. Online skill discovery is also commonly seen in a hierarchical framework that learns a continuous space (Vezhnevets et al, 2017;Hausman et al, 2018;Nachum et al, 2018a; or a discrete set of lower-level policies (Bacon et al, 2017;Stolle & Precup, 2002;Peng et al, 2019), upon which higher-level policies are trained to solve specific tasks.…”

Section: Related Workmentioning

confidence: 99%

TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

Yang¹,

Levine²,

Nachum³

2021

Preprint

View full text Add to dashboard Cite

The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large number. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. We ask the question, is it possible to utilize such suboptimal offline datasets to facilitate provably improved downstream imitation learning? In this work, we answer this question affirmatively and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance. 1

show abstract

Unsupervised Control Through Non-Parametric Discriminative Rewards

Cited by 20 publications

References 18 publications

Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery

Active Hierarchical Exploration with Stable Subgoal Representation Learning

TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

Contact Info

Product

Resources

About