An Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits

Hu, Weici; Frazier, Peter I.

doi:10.48550/arxiv.1707.00205

Cited by 7 publications

(25 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work and its follow-ups such as (Weber and Weiss, 1990) focus on heuristics that are optimal under an asymptotic scaling where the number of pulls per period scales linearly with the total number of arms. More recently, a finite-horizon variant of the restless bandits problem was studied in (Hu and Frazier, 2017) under a similar scaling. For a survey of variations of the restless bandits problem, see (Gittins et al, 2011;Zayas-Caban et al, 2019;Brown and Smith, 2020).…”

Section: Related Literaturementioning

confidence: 99%

The Countable-armed Bandit with Vanishing Arms

Kalvit¹,

Zeevi²

2021

Preprint

View full text Add to dashboard Cite

We consider a bandit problem with countably many arms, partitioned into finitely many types, each characterized by a unique mean reward. A non-stationary distribution governs the relative abundance of each arm-type in the population of arms, aka the arm-reservoir. This non-stationarity is attributable to a probabilistic leakage of "optimal" arms from the reservoir over time, which we refer to as the vanishing arms phenomenon; this induces a time-varying (potentially "endogenous," policy-dependent) distribution over the reservoir. The objective is minimization of the expected cumulative regret. We characterize necessary and sufficient conditions for achievability of sub-linear regret in terms of a critical vanishing rate of optimal arms. We also discuss two reservoir distribution-oblivious algorithms that are long-run-average optimal whenever sub-linear regret is statistically achievable. Numerical experiments highlight a distinctive characteristic of this problem related to ex ante knowledge of the "gap" parameter (the difference between the top two mean rewards): in contrast to the stationary bandit formulation, regret in our setting may suffer substantial inflation under adaptive exploration-based (gap-oblivious) algorithms such as UCB vis-à-vis their non-adaptive forced exploration-based (gap-aware) counterparts like ETC.

show abstract

Section: Related Literaturementioning

confidence: 99%

The Countable-armed Bandit with Vanishing Arms

Kalvit¹,

Zeevi²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This is in contrast to most of the existing Whittle index-based policies that are only well defined in the case that the system is indexable, which is hard to verify and may not hold in general. A line of works [18,19,40] have been focusing on designing index policies without the indexability requirement, and closest to our work is the parallel work on restless bandits [40] with known transition probabilities and reward functions. In particular, [40] explores index policies that are similar to ours, but under the assumption that the individual MDPs of each arms are homogeneous.…”

Section: The Occupancy-measured-reward Index Policymentioning

confidence: 99%

“…Inspired by Whittle's work, many studies focused on finding the index policy for restless bandit problems, e.g., [17,18,19,20,21]. This line of works assumes that the system parameters are known to the decision-maker.…”

Section: Introductionmentioning

confidence: 99%

Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

Xiong¹,

Li²,

Singh³

2021

Preprint

View full text Add to dashboard Cite

We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA) 2 B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA) 2 B-UCB algorithm. As compared with the existing algorithms, R(MA) 2 B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA) 2 B-UCB outperforms the existing algorithms in both regret and run time.

show abstract

“…The proof techniques used by Brown and Smith (2020) and Zayas-Caban et al (2019), however, rely heavily on the Central Limit Theorem (CLT), and do not offer a path toward showing a bound tighter than O( √ N ). Our work fills these two gaps: we propose a broad class of policies, called fluid-priority policies, which generalize the essential characteristics of policies proposed by Brown and Smith (2020) and Hu and Frazier (2017). Addressing the inconsistency between simulation studies and past…”

Section: Introductionmentioning

confidence: 99%

“…As a result, there has been substantial interest (e.g., Whittle 1980, Weber and Weiss 1990, Zayas-Caban et al 2019, Hu and Frazier 2017, Brown and Smith 2020 in developing approximate policies whose performance is provably close to optimal but require computation that does not grow with N . Despite, however, substantial interest and effort focusing on this regime, current understanding is limited in several important ways.…”

Section: Introductionmentioning

confidence: 99%

Restless Bandits with Many Arms: Beating the Central Limit Theorem

Zhang,

Frazier

2021

Preprint

View full text Add to dashboard Cite

Authors are encouraged to submit new papers to INFORMS journals by means of a style file template, which includes the journal title. However, use of a template does not certify that the paper has been accepted for publication in the named journal. INFORMS journal templates are for the exclusive purpose of submitting to an INFORMS journal and should not be used to distribute the papers in print or online or to submit the papers to another publication.

show abstract

An Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits

Cited by 7 publications

References 12 publications

The Countable-armed Bandit with Vanishing Arms

The Countable-armed Bandit with Vanishing Arms

Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

Restless Bandits with Many Arms: Beating the Central Limit Theorem

Contact Info

Product

Resources

About