2018
DOI: 10.1561/2200000070
|View full text |Cite
|
Sign up to set email alerts
|

A Tutorial on Thompson Sampling

Abstract: Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, includ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
209
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 404 publications
(253 citation statements)
references
References 33 publications
1
209
0
Order By: Relevance
“…In response to the computational intractability of the OFU principle, researchers in RL and online learning have proposed the use of Thompson sampling [49] for exploration. Abeille and Lazaric [2] show that the regret of a Thompson sampling approach for LQR scales as O(T 2/3 ) and improve the result to O( √ T ) in [3], where O(·) hides poly-logarithmic factors.…”
Section: Related Workmentioning
confidence: 99%
“…In response to the computational intractability of the OFU principle, researchers in RL and online learning have proposed the use of Thompson sampling [49] for exploration. Abeille and Lazaric [2] show that the regret of a Thompson sampling approach for LQR scales as O(T 2/3 ) and improve the result to O( √ T ) in [3], where O(·) hides poly-logarithmic factors.…”
Section: Related Workmentioning
confidence: 99%
“…Then, the active sampling routine (line 6 in Algorithm 1) can be expanded as Select the user goal i with the maximum p i value Here, N is the Gaussian distribution for introducing randomness. The Thompson-Sampling-like (Russo et al 2018) sub-routine of Algorithm 2 is motivated by two observations: (1) on average, categories with larger failure rate f i are more preferable as they inject more difficult cases (containing more useful information to be learned) based on the current performance of the agent policy. The generated data (simulated experiences) are generally associated with the steepest learning direction and can prospectively boost the training speed; (2) categories that are estimated less reliably (due to a smaller value of n i value) may have a large de facto failure rate, thus worth being allocated with more training instances to reduce the uncertainty.…”
Section: Active Planning Based On World Modelmentioning
confidence: 99%
“…Following the literature on Thompson sampling, we consider a multivariate gaussian distribution since the posterior has a simple closed form, thereby admitting a tractable theoretical analysis. When implementing such an algorithm in practice, more complex distributions can be considered (e.g., see discussion in Russo et al 2018).…”
Section: Meta-learning Formulationmentioning
confidence: 99%
“…This prior captures shared structure of the kind we described above -e.g., the mean of the prior on the student-specific price-elasticity coefficient may be positive with a small standard deviation. It is well known that choosing a good (bad) prior significantly improves (hurts) the empirical performance of the algorithm (Chapelle and Li 2011, Honda and Takemura 2014, Liu and Li 2015, Russo et al 2018). However, the prior is typically unknown in practice, particularly when the decision-maker faces a cold start.…”
Section: Introductionmentioning
confidence: 99%