Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Besbes, Omar; Gur, Yonatan; Zeevi, Assaf

doi:10.1287/stsy.2019.0033

Cited by 183 publications

(308 citation statements)

References 32 publications

Supporting

Mentioning

299

Contrasting

Order By: Relevance

“…Switching costs were considered by [60,71,72], and including these who generally led to remaining at the same arm for longer, similar to patch foraging. The problem of depleting rewards is related to the non-stationary bandit, in the specific case where arm reward rates are choice dependent [73,74]. In a patch with depleting rewards, the expected return changes in a specific and predictable way, and therefore can be treated mathematically, as we • Optimal strategy and arrival time depend on fraction of low versus high yield patches and discriminability between 1st/2nd highest yield patches (Figs.…”

Section: B Relation To Other Workmentioning

confidence: 99%

Normative theory of patch foraging decisions

Kilpatrick

Davidson

Hady

2020

Preprint

View full text Add to dashboard Cite

Foraging is a fundamental behavior as animals' search for food is crucial for their survival. Patch leaving is a canonical foraging behavior, but classic theoretical conceptions of patch leaving decisions lack some key naturalistic details. Optimal foraging theory provides general rules for when an animal should leave a patch, but does not provide mechanistic insights about how those rules change with the structure of the environment. Such a mechanistic framework would aid in designing quantitative experiments to unravel behavioral and neural underpinnings of foraging. To address these shortcomings, we develop a normative theory of patch foraging decisions. Using a Bayesian approach, we treat patch leaving behavior as a statistical inference problem. We derive the animals' optimal decision strategies in both non-depleting and depleting environments. A majority of these cases can be analyzed explicitly using methods from stochastic processes. Our behavioral predictions are expressed in terms of the optimal patch residence time and the decision rule by which an animal departs a patch. We also extend our theory to a hierarchical model in which the forager learns the environmental food resource distribution. The quantitative framework we develop will therefore help experimenters move from analyzing trial based behavior to continuous behavior without the loss of quantitative rigor. Our theoretical framework both extends optimal foraging theory and motivates a variety of behavioral and neuroscientific experiments investigating patch foraging behavior. CONTENTS

show abstract

Section: B Relation To Other Workmentioning

confidence: 99%

Normative theory of patch foraging decisions

Kilpatrick

Davidson

Hady

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Our work differs from cascading bandits in that we consider learning in a non-stationary environment. Non-stationary bandit problems have been widely studied [Slivkins and Upfal, 2008;Yu and Mannor, 2009;Garivier and Moulines, 2011;Besbes et al, 2014;Liu et al, 2018]. However, previous work requires a small action space.…”

Section: Related Workmentioning

confidence: 99%

“…Particularly, we consider abruptly changing environments where user preferences remain constant in certain time periods, named epochs, but change occurs abruptly at unknown moments called breakpoints. The abrupt changes in user preferences give rise to a new challenge of balancing "remembering" and "forgetting" [Besbes et al, 2014]: the more past observations an algorithm retains the higher the risk of making a biased estimator, while the fewer observations retained the higher stochastic error it has on the estimates of the user preferences.…”

Section: Introductionmentioning

confidence: 99%

Cascading Non-Stationary Bandits: Online Learning to Rank in the Non-Stationary Cascade Model

Rijke

2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Non-stationarity appears in many online applications such as web search and advertising. In this paper, we study the online learning to rank problem in a non-stationary environment where user preferences change abruptly at an unknown moment in time. We consider the problem of identifying the K most attractive items and propose cascading nonstationary bandits, an online learning variant of the cascading model, where a user browses a ranked list from top to bottom and clicks on the first attractive item. We propose two algorithms for solving this non-stationary problem: CascadeDUCB and CascadeSWUCB. We analyze their performance and derive gap-dependent upper bounds on the nstep regret of these algorithms. We also establish a lower bound on the regret for cascading nonstationary bandits and show that both algorithms match the lower bound up to a logarithmic factor. Finally, we evaluate their performance on a realworld web search click dataset.

show abstract

“…In the non-stationary case, the rewards {R t (i)} i∈{1,...,N } are drawn from a set of unknown and non-stationary distribution p i (t) that may vary over time. ese variations can be continuous (Slivkins & Upfal, 2008) or abrupt (Garivier & Moulines, 2011;Besbes, Gur, & Zeevi, 2014;Raj & Kalyani, 2017). In this work, we'll consider only abruptly changing environments: the distribution of rewards remains stationary for a given period and a change occurs at an unknown time instants (breakpoints).…”

Section: Non-stationary Bandit Taskmentioning

confidence: 99%

“…ere exist some near optimal algorithms which aim to achieve this bound in stationary (Agrawal, 1995) and non-stationary bandit tasks (Garivier & Moulines, 2011;Besbes et al, 2014;Raj & Kalyani, 2017). While the policy proposed by these algorithms is optimal even in our problem for maximizing expected reward, they assume knowledge of certain parameters like T (length of context) which is not available to the agent.…”

Section: Non-stationary Bandit Taskmentioning

confidence: 99%

Modeling the Role of the Striatum in Non-Stationary Bandit Tasks

Shivkumar

Chakravarthy

Rougier

2017

Preprint

View full text Add to dashboard Cite

Abstract. Decision making in non-stationary and stochastic environments can be interpreted as a variant of non-stationary multi armed bandit task where the optimal decision requires identi cation of the current context. We formalize the problem using a Bayesian approach taking biological constraints into account (limited memory) that allow us to de ne a sub-optimal theoretical model. From this theoretical model, we derive a biological model of the striatum based on its micro-anatomy that is able to learn state and action representations. We show that this model matches the theoretical model for low stochasticity in the environment and could be considered as a neural implementation of the theoretical model. Both models are tested on non-stationary multi-armed bandit task and compared to animal performances.

show abstract

Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Cited by 183 publications

References 32 publications

Normative theory of patch foraging decisions

Normative theory of patch foraging decisions

Cascading Non-Stationary Bandits: Online Learning to Rank in the Non-Stationary Cascade Model

Modeling the Role of the Striatum in Non-Stationary Bandit Tasks

Contact Info

Product

Resources

About