2019
DOI: 10.1287/stsy.2019.0033
|View full text |Cite
|
Sign up to set email alerts
|

Optimal Exploration–Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards

Abstract: In a multi-armed bandit problem, a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected earnings over a planning horizon of length T, and performance is measured in terms of regret relative to a (static) oracle that knows the identity of the best arm a priori. This problem has been studied extensively when the reward distributions do not change over time, and uncertainty essentially amounts to identifying the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

9
299
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 183 publications
(308 citation statements)
references
References 32 publications
9
299
0
Order By: Relevance
“…Switching costs were considered by [60,71,72], and including these who generally led to remaining at the same arm for longer, similar to patch foraging. The problem of depleting rewards is related to the non-stationary bandit, in the specific case where arm reward rates are choice dependent [73,74]. In a patch with depleting rewards, the expected return changes in a specific and predictable way, and therefore can be treated mathematically, as we • Optimal strategy and arrival time depend on fraction of low versus high yield patches and discriminability between 1st/2nd highest yield patches (Figs.…”
Section: B Relation To Other Workmentioning
confidence: 99%
“…Switching costs were considered by [60,71,72], and including these who generally led to remaining at the same arm for longer, similar to patch foraging. The problem of depleting rewards is related to the non-stationary bandit, in the specific case where arm reward rates are choice dependent [73,74]. In a patch with depleting rewards, the expected return changes in a specific and predictable way, and therefore can be treated mathematically, as we • Optimal strategy and arrival time depend on fraction of low versus high yield patches and discriminability between 1st/2nd highest yield patches (Figs.…”
Section: B Relation To Other Workmentioning
confidence: 99%
“…Our work differs from cascading bandits in that we consider learning in a non-stationary environment. Non-stationary bandit problems have been widely studied [Slivkins and Upfal, 2008;Yu and Mannor, 2009;Garivier and Moulines, 2011;Besbes et al, 2014;Liu et al, 2018]. However, previous work requires a small action space.…”
Section: Related Workmentioning
confidence: 99%
“…Particularly, we consider abruptly changing environments where user preferences remain constant in certain time periods, named epochs, but change occurs abruptly at unknown moments called breakpoints. The abrupt changes in user preferences give rise to a new challenge of balancing "remembering" and "forgetting" [Besbes et al, 2014]: the more past observations an algorithm retains the higher the risk of making a biased estimator, while the fewer observations retained the higher stochastic error it has on the estimates of the user preferences.…”
Section: Introductionmentioning
confidence: 99%
“…In the non-stationary case, the rewards {R t (i)} i∈{1,...,N } are drawn from a set of unknown and non-stationary distribution p i (t) that may vary over time. ese variations can be continuous (Slivkins & Upfal, 2008) or abrupt (Garivier & Moulines, 2011;Besbes, Gur, & Zeevi, 2014;Raj & Kalyani, 2017). In this work, we'll consider only abruptly changing environments: the distribution of rewards remains stationary for a given period and a change occurs at an unknown time instants (breakpoints).…”
Section: Non-stationary Bandit Taskmentioning
confidence: 99%
“…ere exist some near optimal algorithms which aim to achieve this bound in stationary (Agrawal, 1995) and non-stationary bandit tasks (Garivier & Moulines, 2011;Besbes et al, 2014;Raj & Kalyani, 2017). While the policy proposed by these algorithms is optimal even in our problem for maximizing expected reward, they assume knowledge of certain parameters like T (length of context) which is not available to the agent.…”
Section: Non-stationary Bandit Taskmentioning
confidence: 99%