Since its introduction, the reward prediction error (RPE) theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain 1-3 . According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. In the present work, we propose a novel account of dopamine-based reinforcement learning. Inspired by recent artificial intelligence research on distributional reinforcement learning 4-6 , we hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea leads immediately to a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.The RPE theory of dopamine derives from work in the artificial intelligence (AI) field of reinforcement learning (RL) 7 . Since the link to neuroscience was first made, however, RL has made substantial advances 8,9 , revealing factors that radically enhance the effectiveness of RL algorithms 10 . In some cases, the relevant mechanisms invite comparison with neural function, suggesting new hypotheses concerning reward-based learning in the brain [11][12][13] . Here, we examine one particularly promising recent development in AI research and
The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins [J. R. Stat. Soc. Ser. B Stat. Methodol. 41 (1979) 148-177], based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: the kl-UCB algorithm is designed for oneparameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins [Adv. in Appl. Math. 6 (1985) 4-22] and Burnetas and Katehakis [Adv. in Appl. Math. 17 (1996) 122-142], respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.