SynonymsReward-based learning; Trial-and-error learning; Temporal-Difference (TD) learning; Policy gradient methods
DefinitionReinforcement learning represents a basic paradigm of learning in artificial intelligence and biology. The paradigm considers an agent (robot, human, animal) that acts in a typically stochastic environment and receives rewards when reaching certain states. The agent's goal is to maximize the expected reward by choosing the optimal action at any given state. In a cortical implementation, the states are defined by sensory stimuli that feed into a neuronal network, and after the network activity is settled, an action is read out. Learning consists in adapting the synaptic connection strengths into and within the neuronal network based on a (typically binary) feedback about the appropriateness of the chosen action. Policy gradient and temporal difference learning are two methods for deriving synaptic plasticity rules that maximize the expected reward in response to the stimuli.
Detailed DescriptionDifferent methods are considered for adapting the synaptic weights w in order to maximize the expected reward R . In general, the weight adaptation has the form âw = R · PI (1) where R = ±1 encodes the reward received upon the chosen action, and PI represents the plasticity induction the synapse was calculating based on the pre-and postsynaptic activity. To prevent a systematic drift of the synaptic weights that is not caused by the co-variation of reward and plasticity induction, either the average reward or the average plasticity induction must vanish, R = 0 or PI = 0. Reinforcement learning can be divided in these two, not mutually exclusive, classes of assuming that (A) PI = 0 or (B) R = 0. The first class encompasses policy gradient methods while the other, wider class, encompasses Temporal Difference (TD) methods. Policy gradient methods assume less structure as they postulates the required property ( PI = 0) on the same synapse of the action selection module that is adapted by the plasticity. TD methods also involve the adaptation of the internal critique since they have to assure that the required property on the modulation signal ( R = 0) holds for each stimulus class separately.A) Policy gradient methods In the simplest biologically plausible form, actions are represented by the activity of a population of neurons. Each neuron in the population is synaptically driven by feedforward input encoding the current stimulus (Fig. 1). The synaptic strengths are adapted according