A Tutorial on Thompson Sampling

Russo, Daniel; Roy, Benjamin Van; Kazerouni, Abbas; Osband, Ian; Wen, Zheng

doi:10.1561/2200000070

Cited by 404 publications

(253 citation statements)

References 33 publications

Supporting

Mentioning

209

Contrasting

Order By: Relevance

“…In response to the computational intractability of the OFU principle, researchers in RL and online learning have proposed the use of Thompson sampling [49] for exploration. Abeille and Lazaric [2] show that the regret of a Thompson sampling approach for LQR scales as O(T 2/3 ) and improve the result to O( √ T ) in [3], where O(·) hides poly-logarithmic factors.…”

Section: Related Workmentioning

confidence: 99%

On the Sample Complexity of the Linear Quadratic Regulator

et al. 2019

View full text Add to dashboard Cite

This paper addresses the optimal control problem known as the Linear Quadratic Regulator in the case when the dynamics are unknown. We propose a multi-stage procedure, called Coarse-ID control, that estimates a model from a few experimental trials, estimates the error in that model with respect to the truth, and then designs a controller using both the model and uncertainty estimate. Our technique uses contemporary tools from random matrix theory to bound the error in the estimation procedure. We also employ a recently developed approach to control synthesis called System Level Synthesis that enables robust control design by solving a quasiconvex optimization problem. We provide end-to-end bounds on the relative error in control cost that are optimal in the number of parameters and that highlight salient properties of the system to be controlled such as closed-loop sensitivity and optimal control magnitude. We show experimentally that the Coarse-ID approach enables efficient computation of a stabilizing controller in regimes where simple control schemes that do not take the model uncertainty into account fail to stabilize the true system.

show abstract

Section: Related Workmentioning

confidence: 99%

On the Sample Complexity of the Linear Quadratic Regulator

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Then, the active sampling routine (line 6 in Algorithm 1) can be expanded as Select the user goal i with the maximum p i value Here, N is the Gaussian distribution for introducing randomness. The Thompson-Sampling-like (Russo et al 2018) sub-routine of Algorithm 2 is motivated by two observations: (1) on average, categories with larger failure rate f i are more preferable as they inject more difficult cases (containing more useful information to be learned) based on the current performance of the agent policy. The generated data (simulated experiences) are generally associated with the steepest learning direction and can prospectively boost the training speed; (2) categories that are estimated less reliably (due to a smaller value of n i value) may have a large de facto failure rate, thus worth being allocated with more training instances to reduce the uncertainty.…”

Section: Active Planning Based On World Modelmentioning

confidence: 99%

Switch-Based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning

Liu

et al. 2019

AAAI

View full text Add to dashboard Cite

Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences. The Dyna-Q algorithm extends Q-learning by integrating a world model, and thus can effectively boost training efficiency using simulated experiences generated by the world model. The effectiveness of Dyna-Q, however, depends on the quality of the world model -or implicitly, the pre-specified ratio of real vs. simulated experiences used for Q-learning. To this end, we extend the recently proposed Deep Dyna-Q (DDQ) framework by integrating a switcher that automatically determines whether to use a real or simulated experience for Q-learning. Furthermore, we explore the use of active learning for improving sample efficiency, by encouraging the world model to generate simulated experiences in the stateaction space where the agent has not (fully) explored. Our results show that by combining switcher and active learning, the new framework named as Switch-based Active Deep Dyna-Q (Switch-DDQ), leads to significant improvement over DDQ and Q-learning baselines in both simulation and human evaluations. 1

show abstract

“…Following the literature on Thompson sampling, we consider a multivariate gaussian distribution since the posterior has a simple closed form, thereby admitting a tractable theoretical analysis. When implementing such an algorithm in practice, more complex distributions can be considered (e.g., see discussion in Russo et al 2018).…”

Section: Meta-learning Formulationmentioning

confidence: 99%

“…This prior captures shared structure of the kind we described above -e.g., the mean of the prior on the student-specific price-elasticity coefficient may be positive with a small standard deviation. It is well known that choosing a good (bad) prior significantly improves (hurts) the empirical performance of the algorithm (Chapelle and Li 2011, Honda and Takemura 2014, Liu and Li 2015, Russo et al 2018). However, the prior is typically unknown in practice, particularly when the decision-maker faces a cold start.…”

Section: Introductionmentioning

confidence: 99%