“…Approximations to the optimal Bayesian policy exist, one of the most successful being Thompson sampling, also known as probability matching (Strens, 2000;Thompson, 1933). In Thompson sampling the agent samples from the posterior over value functions and acts greedily with respect to that sample (Osband et al, 2013(Osband et al, , 2014Lipton et al, 2016;Osband and Van Roy, 2017a), and it can be shown that this strategy yields both Bayesian and frequentist regret bounds under certain assumptions (Agrawal and Goyal, 2017). In practice, maintaining a posterior over value functions is intractable, and so instead the agent maintains the posterior over MDPs, and at each episode an MDP is sampled from this posterior, the value function for that sample is solved for, and the policy is greedy with respect to that value function.…”