Reinforcement learning has been widely used in explaining animal behavior. In reinforcement learning, the agent learns the value of the states in the task, collectively constituting the task state space, and uses the knowledge to choose actions and acquire desired outcomes. It has been proposed that the orbitofrontal cortex (OFC) encodes the task state space during reinforcement learning. However, it is not well understood how the OFC acquires and stores task state information. Here, we propose a neural network model based on reservoir computing. Reservoir networks exhibit heterogeneous and dynamic activity patterns that are suitable to encode task states. The information can be extracted by a linear readout trained with reinforcement learning. We demonstrate how the network acquires and stores task structures. The network exhibits reinforcement learning behavior and its aspects resemble experimental findings of the OFC. Our study provides a theoretical explanation of how the OFC may contribute to reinforcement learning and a new approach to understanding the neural mechanism underlying reinforcement learning.
A key step of decision making is to determine the value associated with each option. The evaluation process often depends on the accumulation of evidence from multiple sources, which may arrive at different times. How evidence is accumulated for value computation in the brain during decision making has not been well studied. To address this problem, we trained rhesus monkeys to perform a decision-making task in which they had to make eye movement choices between two targets, whose reward probabilities had to be determined with the combined evidence from four sequentially presented visual stimuli. We studied the encoding of the reward probabilities associated with the stimuli and the eye movements in the orbitofrontal (OFC) and the dorsolateral prefrontal (DLPFC) cortices during the decision process. We found that the OFC neurons encoded the reward probability associated with individual pieces of evidence in the stimulus domain. Importantly, the representation of the reward probability in the OFC was transient, and the OFC did not encode the reward probability associated with the combined evidence from multiple stimuli. The computation of the combined reward probabilities was observed only in the DLPFC and only in the action domain. Furthermore, the reward probability encoding in the DLPFC exhibited an asymmetric pattern of mixed selectivity that supported the computation of the stimulus-to-action transition of reward information. Our results reveal that the OFC and the DLPFC play distinct roles in the value computation during evidence accumulation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.