A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback

Legenstein, Robert; Pecevski, Dejan; Maass, Wolfgang

doi:10.1371/journal.pcbi.1000180

Cited by 234 publications

(235 citation statements)

References 51 publications

Supporting

Mentioning

226

Contrasting

Unclassified

Order By: Relevance

“…The abstract idea of this concept is that a complex network of calculating identities (e.g., neurons) is so diverse that each task is solved somewhere within the network (Maass et al 2002;Buonomano and Maass 2009;Maass 2010). However, one problem with this approach is the capacity, which depends sublinearly on the number of neurons (Ganguli et al 2008); another problem is the read-out of the task-specific information from the network (Maass et al 2007;Legenstein et al 2008). …”

Section: Physiological Mechanismmentioning

confidence: 99%

Time scales of memory, learning, and plasticity

et al. 2012

View full text Add to dashboard Cite

If we stored every bit of input, the storage capacity of our nervous system would be reached after only about 10 days. The nervous system relies on at least two mechanisms that counteract this capacity limit: compression and forgetting. But the latter mechanism needs to know how long an entity should be stored: some memories are relevant only for the next few minutes, some are important even after the passage of several years. Psychology and physiology have found and described many different memory mechanisms, and these mechanisms indeed use different time scales. In this prospect we review these mechanisms with respect to their time scale and propose relations between mechanisms in learning and memory and their underlying physiological basis.

show abstract

Section: Physiological Mechanismmentioning

confidence: 99%

Time scales of memory, learning, and plasticity

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Thus DA does not further potentiate synaptic plasticity as modeled, but prevents depotentiation leading to a similar overall result of having specific potentiated synapses. The applicability of the reward-modulated model was further tested with computer simulations using networks of leaky integrate-and-fire (LIF) neurons by Legenstein et al (2008). The simulations showed the ability of this type of rule to predict spike times (rather than stimulus delivery times).…”

Section: Models Of the Effects Of Dopamine Releasementioning

confidence: 99%

Computational models of reinforcement learning: the role of dopamine as a reward signal

2010

View full text Add to dashboard Cite

Reinforcement learning is ubiquitous. Unlike other forms of learning, it involves the processing of fast yet content-poor feedback information to correct assumptions about the nature of a task or of a set of stimuli. This feedback information is often delivered as generic rewards or punishments, and has little to do with the stimulus features to be learned. How can such low-content feedback lead to such an efficient learning paradigm? Through a review of existing neuro-computational models of reinforcement learning, we suggest that the efficiency of this type of learning resides in the dynamic and synergistic cooperation of brain systems that use different levels of computations. The implementation of reward signals at the synaptic, cellular, network and system levels give the organism the necessary robustness, adaptability and processing speed required for evolutionary and behavioral success.

show abstract

“…When the LTD-part in the STDP window is suppressed and the remaining R-STDP is bias-corrected, the learning speed for standard association tasks comes close to the one for gradient-based spike reinforcement (Frémaux et al, 2010). An elegant solution to solve the reward-bias problem is to assume that the internal reward signal is shaped by a temporal kernel that sums up to zero across time, R t dt = 0, and hence a positive internal reward signal must be followed or preceded by a negative one (Legenstein et al, 2008). What appears as a computational trick is reminiscent to the observed relieve from pain in fruit flies (Tanimoto et al, 2004), or the reward baseline adaptation in rodents (Schultz et al, 1997).…”

Section: Detailed Descriptionmentioning

confidence: 99%

Reinforcement Learning in Cortical Networks

Senn¹,

Pfister²

2014

Encyclopedia of Computational Neuroscience

View full text Add to dashboard Cite

SynonymsReward-based learning; Trial-and-error learning; Temporal-Difference (TD) learning; Policy gradient methods DefinitionReinforcement learning represents a basic paradigm of learning in artificial intelligence and biology. The paradigm considers an agent (robot, human, animal) that acts in a typically stochastic environment and receives rewards when reaching certain states. The agent's goal is to maximize the expected reward by choosing the optimal action at any given state. In a cortical implementation, the states are defined by sensory stimuli that feed into a neuronal network, and after the network activity is settled, an action is read out. Learning consists in adapting the synaptic connection strengths into and within the neuronal network based on a (typically binary) feedback about the appropriateness of the chosen action. Policy gradient and temporal difference learning are two methods for deriving synaptic plasticity rules that maximize the expected reward in response to the stimuli. Detailed DescriptionDifferent methods are considered for adapting the synaptic weights w in order to maximize the expected reward R . In general, the weight adaptation has the form ∆w = R · PI (1) where R = ±1 encodes the reward received upon the chosen action, and PI represents the plasticity induction the synapse was calculating based on the pre-and postsynaptic activity. To prevent a systematic drift of the synaptic weights that is not caused by the co-variation of reward and plasticity induction, either the average reward or the average plasticity induction must vanish, R = 0 or PI = 0. Reinforcement learning can be divided in these two, not mutually exclusive, classes of assuming that (A) PI = 0 or (B) R = 0. The first class encompasses policy gradient methods while the other, wider class, encompasses Temporal Difference (TD) methods. Policy gradient methods assume less structure as they postulates the required property ( PI = 0) on the same synapse of the action selection module that is adapted by the plasticity. TD methods also involve the adaptation of the internal critique since they have to assure that the required property on the modulation signal ( R = 0) holds for each stimulus class separately.A) Policy gradient methods In the simplest biologically plausible form, actions are represented by the activity of a population of neurons. Each neuron in the population is synaptically driven by feedforward input encoding the current stimulus (Fig. 1). The synaptic strengths are adapted according

show abstract

A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback

Cited by 234 publications

References 51 publications

Time scales of memory, learning, and plasticity

Time scales of memory, learning, and plasticity

Computational models of reinforcement learning: the role of dopamine as a reward signal

Reinforcement Learning in Cortical Networks

Contact Info

Product

Resources

About