Generalization and Exploration via Randomized Value Functions

Osband, Ian; Roy, Benjamin Van; Wen, Zheng

doi:10.48550/arxiv.1402.0635

Cited by 50 publications

(53 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Together, this yields a practical Bayesian algorithm that attains a Bayesian regret upper bounded by Õ(L 3/2 √ SAT ) where L is the time horizon, S is the number of states, A is the number of actions per state, and T is the total number of elapsed time-steps ( Õ ignores logarithmic factors). This matches the bounds for several other Bayesian methods in the literature, see e.g., (Osband et al, 2014). Our regret bound is within a factor of L of the known minimax lower bound of Ω( √ LSAT ) (although, strictly speaking, these bounds are not comparable due to the assumptions we make in order to derive our bound).…”

Section: Introduction and Related Worksupporting

confidence: 85%

“…Approximations to the optimal Bayesian policy exist, one of the most successful being Thompson sampling, also known as probability matching (Strens, 2000;Thompson, 1933). In Thompson sampling the agent samples from the posterior over value functions and acts greedily with respect to that sample (Osband et al, 2013(Osband et al, , 2014Lipton et al, 2016;Osband and Van Roy, 2017a), and it can be shown that this strategy yields both Bayesian and frequentist regret bounds under certain assumptions (Agrawal and Goyal, 2017). In practice, maintaining a posterior over value functions is intractable, and so instead the agent maintains the posterior over MDPs, and at each episode an MDP is sampled from this posterior, the value function for that sample is solved for, and the policy is greedy with respect to that value function.…”

Section: Introduction and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Variational Bayesian Reinforcement Learning with Regret Bounds

O’Donoghue¹

2018

Preprint

View full text Add to dashboard Cite

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with an epistemic-risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized to minimize regret, or annealed according to a schedule. We call the resulting algorithm K-learning and we show that the K-values that the agent maintains are optimistic for the expected optimal Q-values at each state-action pair. The utility function approach induces a natural Boltzmann exploration policy for which the 'temperature' parameter is equal to the risk-seeking parameter. This policy achieves a Bayesian regret bound of Õ(L 3/2 √ SAT ), where L is the time horizon, S is the number of states, A is the number of actions, and T is the total number of elapsed time-steps. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient. K-learning is simple to implement, as it only requires adding a bonus to the reward at each stateaction and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

show abstract

Section: Introduction and Related Worksupporting

confidence: 85%

Section: Introduction and Related Workmentioning

confidence: 99%

Variational Bayesian Reinforcement Learning with Regret Bounds

O’Donoghue¹

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…We conclude by addressing a potential criticism of proposition 1, i.e. that the described issues may be circumvented by initialising expected Q values to a value higher than the maximal attainable Q value in given MDP, an approach known as optimistic initialisation (Osband et al, 2014). In such case, symmetries in the Q function may break as updates are performed and move towards more realistic Q values.…”

Section: Randomised Policy Iteration and Propagation Of Uncertaintymentioning

confidence: 98%

“…Many recent model-free methods of exploration in reinforcement learning can be interpreted as attempts to scale the PSRL algorithm beyond tabular settings, and combine it with neural network function approximation (Osband et al, 2014Moerland et al, 2017;O'Donoghue et al, 2018;Azizzadenesheli et al, 2018). To scale beyond tabular settings, these methods depart from PSRL by directly modelling a distribution over Q functions, P Q, instead of a distribution over MDPs, P T , an approach known as randomised value functions (Osband et al, 2017).…”

Section: Randomised Policy Iteration and Propagation Of Uncertaintymentioning

confidence: 99%

Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Janz,

Hron,

Mazur

et al. 2018

Preprint

View full text Add to dashboard Cite

Posterior sampling for reinforcement learning (PSRL) is an effective method of balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation fail to satisfy the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.

show abstract

“…One possible solution is policy parameter perturbation in a large time scale. Though previous attempts were restricted to linear function approximators (Rückstieß et al, 2008;Osband et al, 2014), progress has been made with neural networks, through either network section duplication (Osband et al, 2016) or adaptive-scale parameter noise injection Fortunato et al, 2017). However, in Osband et al (2016) the episode-wise stochasticity is unadjustable, and the duplicated modules do not cooperate with each other.…”

Section: Introductionmentioning

confidence: 99%

NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning

Xie¹,

Huang²,

Lei³

et al. 2018

Preprint

View full text Add to dashboard Cite

Reinforcement learning agents need exploratory behaviors to escape from local optima. These behaviors may include both immediate dithering perturbation and temporally consistent exploration. To achieve these, a stochastic policy model that is inherently consistent through a period of time is in desire, especially for tasks with either sparse rewards or long term information. In this work, we introduce a novel on-policy temporally consistent exploration strategy -Neural Adaptive Dropout Policy Exploration (NADPEx) -for deep reinforcement learning agents. Modeled as a global random variable for conditional distribution, dropout is incorporated to reinforcement learning policies, equipping them with inherent temporal consistency, even when the reward signals are sparse. Two factors, gradients' alignment with the objective and KL constraint in policy space, are discussed to guarantee NADPEx policy's stable improvement. Our experiments demonstrate that NADPEx solves tasks with sparse reward while naive exploration and parameter noise fail. It yields as well or even faster convergence in the standard mujoco benchmark for continuous control.

show abstract

Generalization and Exploration via Randomized Value Functions

Cited by 50 publications

References 14 publications

Variational Bayesian Reinforcement Learning with Regret Bounds

Variational Bayesian Reinforcement Learning with Regret Bounds

Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning

Contact Info

Product

Resources

About