2014
DOI: 10.48550/arxiv.1402.0635
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Generalization and Exploration via Randomized Value Functions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
52
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 50 publications
(53 citation statements)
references
References 14 publications
1
52
0
Order By: Relevance
“…Together, this yields a practical Bayesian algorithm that attains a Bayesian regret upper bounded by Õ(L 3/2 √ SAT ) where L is the time horizon, S is the number of states, A is the number of actions per state, and T is the total number of elapsed time-steps ( Õ ignores logarithmic factors). This matches the bounds for several other Bayesian methods in the literature, see e.g., (Osband et al, 2014). Our regret bound is within a factor of L of the known minimax lower bound of Ω( √ LSAT ) (although, strictly speaking, these bounds are not comparable due to the assumptions we make in order to derive our bound).…”
Section: Introduction and Related Worksupporting
confidence: 85%
See 1 more Smart Citation
“…Together, this yields a practical Bayesian algorithm that attains a Bayesian regret upper bounded by Õ(L 3/2 √ SAT ) where L is the time horizon, S is the number of states, A is the number of actions per state, and T is the total number of elapsed time-steps ( Õ ignores logarithmic factors). This matches the bounds for several other Bayesian methods in the literature, see e.g., (Osband et al, 2014). Our regret bound is within a factor of L of the known minimax lower bound of Ω( √ LSAT ) (although, strictly speaking, these bounds are not comparable due to the assumptions we make in order to derive our bound).…”
Section: Introduction and Related Worksupporting
confidence: 85%
“…Approximations to the optimal Bayesian policy exist, one of the most successful being Thompson sampling, also known as probability matching (Strens, 2000;Thompson, 1933). In Thompson sampling the agent samples from the posterior over value functions and acts greedily with respect to that sample (Osband et al, 2013(Osband et al, , 2014Lipton et al, 2016;Osband and Van Roy, 2017a), and it can be shown that this strategy yields both Bayesian and frequentist regret bounds under certain assumptions (Agrawal and Goyal, 2017). In practice, maintaining a posterior over value functions is intractable, and so instead the agent maintains the posterior over MDPs, and at each episode an MDP is sampled from this posterior, the value function for that sample is solved for, and the policy is greedy with respect to that value function.…”
Section: Introduction and Related Workmentioning
confidence: 99%
“…We conclude by addressing a potential criticism of proposition 1, i.e. that the described issues may be circumvented by initialising expected Q values to a value higher than the maximal attainable Q value in given MDP, an approach known as optimistic initialisation (Osband et al, 2014). In such case, symmetries in the Q function may break as updates are performed and move towards more realistic Q values.…”
Section: Randomised Policy Iteration and Propagation Of Uncertaintymentioning
confidence: 98%
“…Many recent model-free methods of exploration in reinforcement learning can be interpreted as attempts to scale the PSRL algorithm beyond tabular settings, and combine it with neural network function approximation (Osband et al, 2014Moerland et al, 2017;O'Donoghue et al, 2018;Azizzadenesheli et al, 2018). To scale beyond tabular settings, these methods depart from PSRL by directly modelling a distribution over Q functions, P Q, instead of a distribution over MDPs, P T , an approach known as randomised value functions (Osband et al, 2017).…”
Section: Randomised Policy Iteration and Propagation Of Uncertaintymentioning
confidence: 99%
“…One possible solution is policy parameter perturbation in a large time scale. Though previous attempts were restricted to linear function approximators (Rückstieß et al, 2008;Osband et al, 2014), progress has been made with neural networks, through either network section duplication (Osband et al, 2016) or adaptive-scale parameter noise injection Fortunato et al, 2017). However, in Osband et al (2016) the episode-wise stochasticity is unadjustable, and the duplicated modules do not cooperate with each other.…”
Section: Introductionmentioning
confidence: 99%