Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Achiam, Joshua; Sastry, S. Shankar

doi:10.48550/arxiv.1703.01732

Cited by 60 publications

(107 citation statements)

References 4 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…The utilisation of forward modelling error as reward signal has been implemented in deterministic and probabilistic settings (Achiam and Sastry, 2017;Shelhamer et al, 2016) and is commonly known as curiosity learning. The error reward signal encourages the exploration of unfamiliar parts of the state-action space which are not yet wellpredictable.…”

Section: Related Workmentioning

confidence: 99%

Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration

Groth,

Wulfmeier,

Vezzani

et al. 2021

Preprint

View full text Add to dashboard Cite

Curiosity-based reward schemes can present powerful exploration mechanisms which facilitate the discovery of solutions for complex, sparse or long-horizon tasks. However, as the agent learns to reach previously unexplored spaces and the objective adapts to reward new areas, many behaviours emerge only to disappear due to being overwritten by the constantly shifting objective. We argue that merely using curiosity for fast environment exploration or as a bonus reward for a specific task does not harness the full potential of this technique and misses useful skills. Instead, we propose to shift the focus towards retaining the behaviours which emerge during curiosity-based learning. We posit that these self-discovered behaviours serve as valuable skills in an agent's repertoire to solve related tasks. Our experiments demonstrate the continuous shift in behaviour throughout training and the benefits of a simple policy snapshot method to reuse discovered behaviour for transfer tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration

Groth,

Wulfmeier,

Vezzani

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This is often done by providing intrinsic motivation via a self-derived reward, resulting in "curiosity-driven" behaviour [8,4]. Such approaches include surprise [1,5] (where experiencing unexpected dynamics is rewarded), and empowerment [40,20] (where the agent prefers states in which it has more control).…”

Section: Related Workmentioning

confidence: 99%

“…On the other hand, we can show that the formulation of the metric learning loss is susceptible to embedding explosion if the representation space is left unconstrained 2 . In our work, we build upon the DBC model in an attempt to tackle both problems: (1) we address embedding explosion by stabilizing the state representation space via a norm constraint and (2) we prevent embedding collapse by altering the encoder training method.…”

Section: Introductionmentioning

confidence: 99%

Towards Robust Bisimulation Metric Learning

Kemertas¹,

Aumentado-Armstrong²

2021

Preprint

View full text Add to dashboard Cite

Learned representations in deep reinforcement learning (DRL) have to extract taskrelevant information from complex observations, balancing between robustness to distraction and informativeness to the policy. Such stable and rich representations, often learned via modern function approximation techniques, can enable practical application of the policy improvement theorem, even in high-dimensional continuous state-action spaces. Bisimulation metrics offer one solution to this representation learning problem, by collapsing functionally similar states together in representation space, which promotes invariance to noise and distractors. In this work, we generalize value function approximation bounds for on-policy bisimulation metrics to non-optimal policies and approximate environment dynamics. Our theoretical results help us identify embedding pathologies that may occur in practical use. In particular, we find that these issues stem from an underconstrained dynamics model and an unstable dependence of the embedding norm on the reward signal in environments with sparse rewards. Further, we propose a set of practical remedies: (i) a norm constraint on the representation space, and (ii) an extension of prior approaches with intrinsic rewards and latent space regularization. Finally, we provide evidence that the resulting method is not only more robust to sparse reward functions, but also able to solve challenging continuous control tasks with observational distractions, where prior methods fail.

show abstract

“…Exploration in RL: Exploration is one of the most important issues in model-free RL, as there is the key assumption that all state-action pairs must be visited infinitely often to guarantee the convergence of Q-function [56]. In order to explore diverse state-action pairs in the joint state-action space, various methods have been considered in prior works: intrinsically-motivated reward based on curiosity [5,11], model prediction error [1,10], information gain [26,28,29], and counting states [33,35]. These exploration techniques improve exploration and performance in challenging sparse-reward environments [3,10,13].…”

Section: Related Workmentioning

confidence: 99%

“…Starting from the left-lower corner (0.5, 0.5), the agent explores the maze without any external reward. First, note that for this pure exploration task, the optimal policy maximizing J M axEnt (π) is given by the uniform policy that selects all actions in 1] uniformly regardless of the value of s t . This is because the uniform distribution has maximum entropy for a bounded space [14].…”

Section: Saturationmentioning

confidence: 99%

A Max-Min Entropy Framework for Reinforcement Learning

Han

Sung

2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in modelfree sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms.Preprint. Under review.

show abstract

Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Cited by 60 publications

References 4 publications

Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration

Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration

Towards Robust Bisimulation Metric Learning

A Max-Min Entropy Framework for Reinforcement Learning

Contact Info

Product

Resources

About