Summary Central to the organization of behavior is the ability to predict the values of outcomes to guide choices. The accuracy of such predictions is honed by a teaching signal that indicates how incorrect a prediction was (‘reward prediction error’, RPE). In several reinforcement learning contexts such as Pavlovian conditioning and decisions guided by reward history, this RPE signal is provided by midbrain dopamine neurons. In many situations, however, the stimuli predictive of outcomes are perceptually ambiguous. Perceptual uncertainty is known to influences choices, but it has been unclear whether or how dopamine neurons factor it into their teaching signal. To cope with uncertainty, we extended a reinforcement learning model with a belief state about the perceptually ambiguous stimulus; this model generates an estimate of the probability of choice correctness, termed decision confidence. We show that dopamine responses in monkeys performing a perceptually ambiguous decision task comply with the model’s predictions. Consequently, dopamine responses did not simply reflect a stimulus’ average expected reward value, but were predictive of the trial-to-trial fluctuations in perceptual accuracy. These confidence-dependent dopamine responses emerged prior to monkeys’ choice initiation raising the possibility that dopamine impacts impeding decisions, in addition to encoding a post-decision teaching signal. Finally, by manipulating reward size, we found that dopamine neurons reflect both the upcoming reward size and the confidence in achieving it. Together, our results show that dopamine responses convey teaching signals that are also appropriate for perceptual decisions.
A well-established notion in cognitive neuroscience proposes that multiple brain systems contribute to choice behaviour. These include: (1) a model-free system that uses values cached from the outcome history of alternative actions, and (2) a model-based system that considers action outcomes and the transition structure of the environment. The widespread use of this distinction, across a range of applications, renders it important to index their distinct influences with high reliability. Here we consider the two-stage task, widely considered as a gold standard measure for the contribution of model-based and model-free systems to human choice. We tested the internal/temporal stability of measures from this task, including those estimated via an established computational model, as well as an extended model using drift-diffusion. Drift-diffusion modeling suggested that both choice in the first stage, and RTs in the second stage, are directly affected by a model-based/free trade-off parameter. Both parameter recovery and the stability of model-based estimates were poor but improved substantially when both choice and RT were used (compared to choice only), and when more trials (than conventionally used in research practice) were included in our analysis. The findings have implications for interpretation of past and future studies based on the use of the two-stage task, as well as for characterising the contribution of model-based processes to choice behaviour.
Behavioral and neural evidence reveal a prospective goal-directed decision process that relies on mental simulation of the environment, and a retrospective habitual process that caches returns previously garnered from available choices. Artificial systems combine the two by simulating the environment up to some depth and then exploiting habitual values as proxies for consequences that may arise in the further future. Using a three-step task, we provide evidence that human subjects use such a normative plan-until-habit strategy, implying a spectrum of approaches that interpolates between habitual and goal-directed responding. We found that increasing time pressure led to shallower goal-directed planning, suggesting that a speed-accuracy tradeoff controls the depth of planning with deeper search leading to more accurate evaluation, at the cost of slower decision-making. We conclude that subjects integrate habit-based cached values directly into goal-directed evaluations in a normative manner.planning | habit | reinforcement learning | speed/accuracy tradeoff | tree-based evaluation B ehavioral and neural evidence suggest that the brain uses distinct goal-directed and habitual systems for decisionmaking (1-5). A goal-directed system exploits an individual's model, i.e., their knowledge of environmental dynamics, to simulate the consequences that will likely follow a choice (6) (Fig. 1A). Such evaluations, which assess a decision-tree expanding into the future to estimate the total reward, adapt flexibly to changes in environmental dynamics or the values of outcomes. Evaluating deep trees, however, is computationally expensive (in terms of time, working memory, metabolic energy, etc.) and potentially error-prone. By contrast, the habitual system simply caches the rewards received on previous trials conditional on the choice (Fig. 1C) without a representational characterization of the environment (hence being called "model-free") (6, 7). This process hinders adaptation to changes in the environment, but has advantageous computational simplicity. Previous studies show distinct behavioral and neurobiological signatures of both systems (8-18). Furthermore, consistent with the theoretical strengths and weaknesses of each system (2, 19), different experimental conditions influence the relative contributions of the two systems in controlling behavior according to their respective competencies (20)(21)(22)(23).Here, we suggest that individuals, rather than simply showing greater reliance on the more competent system in each condition, combine the relative strengths of the two systems in a normative manner by integrating habit-based cached values directly into goaldirected evaluations. Specifically, we propose that given available resources (time, working memory, etc.), individuals decide the depth k up to which they can afford full forward simulations and use cached habitual values thereafter. That is, individuals compute the value of a choice by adding the first k rewards, predicted by the explicit simulation, to the value...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.