A Tutorial on Thompson Sampling

Russo, Daniel; Roy, Benjamin Van; Kazerouni, Abbas; Osband, Ian; Wen, Zheng

doi:10.48550/arxiv.1707.02038

Cited by 70 publications

(91 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…WSLTS performs Thompson Sampling using the reshaped posterior, excluding the previously selected arm. That is, WSLTS follows standard Thompson Sampling for Bernoulli bandits, as described in [15], with two important differences: First, as opposed to sampling from the posterior over all arms, the sampled reward probability of the previously selected arm is set to zero. This is to ensure that WSLTS follows the core semantics of WSLS as a strict generalization that allows for more sophisticated exploration/exploitation mechanisms.…”

Section: Win-stay Lose-thompson-sample (Wslts)mentioning

confidence: 99%

Bayesian Optimal Experimental Design for Simulator Models of Cognition

Valentin¹,

Kleinegesse²,

Bramley³

et al. 2021

Preprint

View full text Add to dashboard Cite

Bayesian optimal experimental design (BOED) is a methodology to identify experiments that are expected to yield informative data. Recent work in cognitive science considered BOED for computational models of human behavior with tractable and known likelihood functions. However, tractability often comes at the cost of realism; simulator models that can capture the richness of human behavior are often intractable. In this work, we combine recent advances in BOED and approximate inference for intractable models, using machine-learning methods to find optimal experimental designs, approximate sufficient summary statistics and amortized posterior distributions. Our simulation experiments on multi-armed bandit tasks show that our method results in improved model discrimination and parameter estimation, as compared to experimental designs commonly used in the literature.

show abstract

Section: Win-stay Lose-thompson-sample (Wslts)mentioning

confidence: 99%

Bayesian Optimal Experimental Design for Simulator Models of Cognition

Valentin¹,

Kleinegesse²,

Bramley³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Independent beta-distributed priors with parameters α k = 1 and β k = 1 (corresponding to a uniform distribution) over the estimation of each p k are assumed. At each iteration of TS, a sample is drawn from the posterior distribution of p k for each arm, and the arm with the largest sampled value is selected (Chapelle and Li, 2011;Russo et al, 2017). Choosing actions with TS balances exploration and exploitation in the long run, sampling from arms with the goal of converging on an optimal arm asymptotically (Agrawal and Goyal, 2012).…”

Section: Beta-bernoulli Thompson Samplingmentioning

confidence: 99%

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Nogas¹,

Li²,

Yanez³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one (Rafferty et al., 2019). We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson Sampling (TS). First, Top-Two Thompson Sampling (Russo, 2016) adds a fixed amount of uniform random allocation (UR) spread evenly over time. Second, a novel heuristic algorithm, called TS PostDiff (Posterior Probability of Difference). TS PostDiff takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is 'small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TS PostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size.

show abstract

“…To balance exploitation with exploration, we use a sampling algorithm to produce plausible estimates of the probability of a click and the probability of a "yes" survey. We compared both Thompson sampling [16,20] and EwS [13], and qualitatively we found that results in our application looked better with Thompson sampling so we focus on it here. However, EwS is reasonable to use as well.…”

Section: Learning With Discrete Contextmentioning

confidence: 99%

Contextual Bandit Applications in a Customer Support Bot

Sajeev

Huang

Karampatziakis

et al. 2021

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Virtual support agents have grown in popularity as a way for businesses to provide better and more accessible customer service. Some challenges in this domain include ambiguous user queries as well as changing support topics and user behavior (non-stationarity). We do, however, have access to partial feedback provided by the user (clicks, surveys, and other events) which can be leveraged to improve the user experience. Adaptable learning techniques, like contextual bandits, are a natural fit for this problem setting. In this paper, we discuss real-world implementations of contextual bandits (CB) for the Microsoft virtual agent. It includes intent disambiguation based on neural-linear bandits (NLB) and contextual recommendations based on a collection of multi-armed bandits (MAB). Our solutions have been deployed to production and have improved key business metrics of the Microsoft virtual agent, as confirmed by A/B experiments. Results include a relative increase of over 12% in problem resolution rate and relative decrease of over 4% in escalations to a human operator. While our current use cases focus on intent disambiguation and contextual recommendation for support bots, we believe our methods can be extended to other domains.

show abstract

A Tutorial on Thompson Sampling

Cited by 70 publications

References 36 publications

Bayesian Optimal Experimental Design for Simulator Models of Cognition

Bayesian Optimal Experimental Design for Simulator Models of Cognition

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Contextual Bandit Applications in a Customer Support Bot

Contact Info

Product

Resources

About