How to compute initially unknown reward values makes up one of the key problems in reinforcement learning theory, with two basic approaches being used. Model-free algorithms rely on the accumulation of substantial amounts of experience to compute the value of actions, whereas in model-based learning, the agent seeks to learn the generative process for outcomes from which the value of actions can be predicted. Here we show that (i) "probability matching"-a consistent example of suboptimal choice behavior seen in humans -occurs in an optimal Bayesian model-based learner using a max decision rule that is initialized with ecologically plausible, but incorrect beliefs about the generative process for outcomes and (ii) human behavior can be strongly and predictably altered by the presence of cues suggestive of various generative processes, despite statistically identical outcome generation. These results suggest human decision making is rational and model based and not consistent with model-free learning.decision making | probability matching | reinforcement learning G iven a limited set of data about the world, what is the best thing to do? This question lies at the heart of all decision making, from simple everyday errands to elaborate and complex scientific experiments. If the reward amount for each possible action is known in advance, it is a straightforward process to make choices that maximize reward. In the real world, however, reward values are nearly always initially unknown and computing them is not trivial. Thus, understanding how to learn and compute reward is one of the key problems in reinforcement learning theory. Computing the optimal policy (i.e., determining the "best thing to do") requires acquiring one of two types of knowledge. In model-free learning, an agent must accumulate a substantial amount of experience regarding the consequences of taking various actions in various states, from which the average value of the states can be learned. In model-based learning, an agent must acquire a "world model," which constitutes beliefs about how the world generates outcomes in response to actions. Although both model-free and model-based reinforcement-learning algorithms have been the subject of much study in computer science and machine learning, model-free algorithms have been primarily used as models of human choice behavior.Whereas it is clear that our survival depends on the ability to make appropriate decisions from incomplete and ambiguous information, numerous studies in economics, psychology, and neuroscience have consistently found highly suboptimal behavior in seemingly simple decision tasks. Why is this? Consider the sequential binary decision task, which involves a choice between two options, one with a higher probability of success than the other (e.g., 70% vs. 30% of trials). The optimal strategy for this task is to determine which option has a higher probability of success and then choose only that option. Humans, however, tend to sample the alternatives in proportion to the options' respec...