Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

Koulouriotis, Dimitrios E.; Xanthopoulos, A. S.

doi:10.1016/j.amc.2007.07.043

Cited by 68 publications

(34 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…While there is no such currently established framework for social learning research, multi-armed bandits have been widely deployed to study learning across biology, economics, artificial intelligence research and computer science (e.g. 18, 24, 25–28) because they mimic a common problem faced by individuals that must make decisions about how to allocate their time in order to maximize their payoffs. Multi-armed bandits capture the essence of many difficult problems in the real world, for instance, where there are many possible actions, only a few of which yield a high payoff, where it is possible to learn asocially or through observation of others, where copying error occurs and where the environment changes.…”

Section: The Tournamentmentioning

confidence: 99%

Why Copy Others? Insights from the Social Learning Strategies Tournament

Rendell

Boyd

Cownden

et al. 2010

Science

667

621

View full text Add to dashboard Cite

Social learning (learning through observation or interaction with other individuals) is widespread in nature and is central to the remarkable success of humanity, yet it remains unclear why it pays to copy, and how best to do so. To address these questions we organised a computer tournament in which entrants submitted strategies specifying how to use social learning and its asocial alternative (e.g. trial-and-error) to acquire adaptive behavior in a complex environment. Most current theory predicts the emergence of mixed strategies that rely on some combination of the two types of learning. In the tournament, however, strategies that relied heavily on social learning were found to be remarkably successful, even when asocial information was no more costly than social information. Social learning proved advantageous because individuals frequently demonstrated the highest-payoff behavior in their repertoire, inadvertently filtering information for copiers. The winning strategy (discountmachine) relied exclusively on social learning, and weighted information according to the time since acquisition.Human culture is widely thought to underlie the extraordinary demographic success of our species, manifest in virtually every terrestrial habitat (1-2). Cultural processes facilitate the spread of adaptive knowledge, accumulated over generations, allowing individuals to acquire vital life skills. One of the foundations of culture is social learning -learning influenced by observation or interaction with other individuals (3) -which occurs widely, in various forms, * To whom correspondence should be addressed. ler4@st-andrews.ac.uk.One sentence summary: A computer tournament helps to explain why social learning is common in nature and why human beings happen to be so good at it. NIH Public Access Author ManuscriptScience. Author manuscript; available in PMC 2010 November 22. Published in final edited form as:Science. 2010 April 9; 328(5975): 208-213. doi:10.1126/science.1184719. NIH-PA Author ManuscriptNIH-PA Author Manuscript NIH-PA Author Manuscript across the animal kingdom (4). Yet it remains something of a mystery why it pays individuals to copy others, and how best to do this.At first sight, social learning appears advantageous because it allows individuals to avoid the costs, in terms of effort and risk, of trial-and-error learning. However, social learning can also cost time and effort, and theoretical work reveals that it can be error prone, leading individuals to acquire inappropriate or outdated information in nonuniform and changing environments (5-11). Current theory suggests that to avoid these errors individuals should be selective in when and how they use social learning, so as to balance its advantages against the risks inherent in its indiscriminate use (9). Accordingly, natural selection is expected to have favoured social learning strategies, psychological mechanisms that specify when individuals copy, and from whom they learn (12-13).These issues lie at the interface of multiple academic fields,...

show abstract

Section: The Tournamentmentioning

confidence: 99%

Why Copy Others? Insights from the Social Learning Strategies Tournament

Rendell

Boyd

Cownden

et al. 2010

Science

667

621

View full text Add to dashboard Cite

show abstract

“…The restless bandit problem is known to be PSPACE-complete, meaning it is difficult to compute optimal solutions for in practice [30,13]. Multi-armed bandit problems have previously been used to study the tradeoff between exploitation and exploration in learning environments [36,24].…”

Section: Related Computational Techniquesmentioning

confidence: 99%

Theoretical and Experimental Analysis of an Evolutionary Social-Learning Game

Carr¹,

Raboin²,

Parker³

et al. 2012

View full text Add to dashboard Cite

An important way to learn new actions and behaviors is by observing others, and several evolutionary games have been developed to investigate what learning strategies work best and how they might have evolved. In this paper we present an extensive set of mathematical and simulation results for Cultaptation, which is one of the best-known such games.We derive a formula for measuring a strategy's expected reproductive success, provide algorithms to compute near-best-response strategies and near-Nash equilibria, and provide techniques for efficient implementation of those algorithms. Our experimental studies provide strong evidence for the following hypotheses:1. The best strategies for Cultaptation and similar games are likely to be conditional ones in which the choice of action at each round is conditioned on the agent's accumulated experience. Such strategies (or close approximations of them) can be computed by doing a lookahead search that predicts how each possible choice of action at the current round is likely to affect future performance. 2. Such strategies are likely to exploit most of the time, but will have ways of quickly detecting structural shocks, so that they can switch quickly to innovation in order to learn how to respond to such shocks. This conflicts with the conventional wisdom that successful social-learning strategies are characterized by a high frequency of innovation; and agrees with recent experiments by others on human subjects that also challenge the conventional wisdom.

show abstract

“…In its simplest form, reinforcement-learning analyses often use the multi-armed (or "n-armed") bandit task to evaluate various methods of distributing exploration and exploitation (e.g., Dimitrakakis & Lagoudakis, 2008;Sikora, 2008). This task provides an excellent platform to explore choice in stationary (with unchanging payoffs) and nonstationary (with changing payoffs) environments, and it has also been applied to the domains of human learning and cognition (e.g., Burns, Lee, & Vickers 2006;Plowright & Shettleworth, 1990), economics (e.g., Banks, Olson, & Porter 1997), marketing and management (e.g., Azoulay-Schwartz, Kraus, & Wilkenfeld 2004;Valsecchi, 2003), and math and computer science (e.g., Auer, Cesa-Bianchi, Freund, & Schapire 1995;Koulouriotis & Xanthopoulos, 2008).…”

mentioning

confidence: 99%

“…We took an approach that was inspired by the study of reinforcement-learning algorithms as applied to machine learning (Koulouriotis & Xanthopoulos, 2008;Sutton & Barto, 1998). In its simplest form, reinforcement-learning analyses often use the multi-armed (or "n-armed") bandit task to evaluate various methods of distributing exploration and exploitation (e.g., Dimitrakakis & Lagoudakis, 2008;Sikora, 2008).…”

mentioning

confidence: 99%

Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules

Racey¹,

Young²,

Garlick³

et al. 2011

Learn Behav

View full text Add to dashboard Cite

The tension between exploitation of the best options and exploration of alternatives is a ubiquitous problem that all organisms face. To examine this trade-off across species, pigeons and people were trained on an eight-armed bandit task in which the options were rewarded on a variable interval (VI) schedule. At regular intervals, each option's VI changed, thus encouraging dynamic increases in exploration in response to these anticipated changes. Both species showed sensitivity to the payoffs that was often well modeled by Luce's (1963) decision rule. For pigeons, exploration of alternative options was driven by experienced changes in the payoff schedules, not the beginning of a new session, even though each session signaled a new schedule. In contrast, people quickly learned to explore in response to signaled changes in the payoffs.

show abstract

Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

Cited by 68 publications

References 13 publications

Why Copy Others? Insights from the Social Learning Strategies Tournament

Why Copy Others? Insights from the Social Learning Strategies Tournament

Theoretical and Experimental Analysis of an Evolutionary Social-Learning Game

Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules

Contact Info

Product

Resources

About