Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

Biswas, Arpita; Aggarwal, Gaurav; Varakantham, Pradeep; Tambe, Milind

doi:10.24963/ijcai.2021/556

Cited by 15 publications

(19 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We depart from [10] by studying the discounted counterpart as motivated by [42] since the difference in the optimal value between the discounted and average settings is small as long as α is close to 1 [40], [41]. Recently, another line of work [43] leveraged Q-learning to approximate Whittle indices through a single-timescale SA where Q-function and Whittle indices were learned independently. [43] considered the finite-horizon MDP and cannot be directly applied to infinite-horizon discounted or average reward MDPs.…”

Section: B Q-whittle Learningmentioning

confidence: 99%

“…Recently, another line of work [43] leveraged Q-learning to approximate Whittle indices through a single-timescale SA where Q-function and Whittle indices were learned independently. [43] considered the finite-horizon MDP and cannot be directly applied to infinite-horizon discounted or average reward MDPs. Finally, we are the first to provide a finite-time analysis of Whittle index based Q-learning, which further differentiates our work.…”

Section: B Q-whittle Learningmentioning

confidence: 99%

“…Q-learning based Algorithms. We compare our Q-Whittle learning to existing Q-learning algorithms (see Remark 3) when system parameters are unknown: (a) Q-learning Whittle Index Controller (Fu) [11]; (b) Q learning for Whittle index (AB) [10]; (c) Whittle Index Q-learning (WIQL) [43]; and (d) our Whittle policy, i.e., assume full knowledge of underlying transition probabilities. The discount factor is α = 0.8, learning rates are initialized to γ(0) = 0.1 and η(0) = 0.01, and are decayed by half every 1, 000 time steps.…”

Section: A Baselinesmentioning

confidence: 99%

See 2 more Smart Citations

Whittle Index based Q-Learning for Wireless Edge Caching with Linear Function Approximation

Xiong¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

An explosive growth in the number of on-demand content requests has imposed significant pressure on current wireless network infrastructure. To enhance the perceived user experience, and support latency-sensitive applications, edge computing has emerged as a promising computing paradigm. The performance of a wireless edge depends on contents that are cached. In this paper, we consider the problem of content caching at the wireless edge with unreliable channels to minimize average content request latency. We formulate this problem as a restless bandit problem, which is provably hard to solve. We begin by investigating a discounted counterpart, and prove that it admits an optimal policy of the threshold-type. We then show that the result also holds for the average latency problem. Using these structural results, we establish the indexability of the problem, and employ Whittle index policy to minimize average latency. Since system parameters such as content request rate are often unknown, we further develop a model-free reinforcement learning algorithm dubbed Q-Whittle learning that relies on our index policy. We also derive a bound on its finite-time convergence rate. Simulation results using real traces demonstrate that our proposed algorithms yield excellent empirical performance.

show abstract

Section: B Q-whittle Learningmentioning

confidence: 99%

Section: B Q-whittle Learningmentioning

confidence: 99%

Section: A Baselinesmentioning

confidence: 99%

See 1 more Smart Citation

Whittle Index based Q-Learning for Wireless Edge Caching with Linear Function Approximation

Xiong¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, this is suboptimal for general RMABs since rewards are state-and action-dependent. Addressing this, Biswas et al [5] give a Q-learning-based based algorithm that acts on the arms that have the largest difference between their active and passive Q values. Fu et al [8] take a related approach that adjust the Q values by some 𝜆, and use it to estimate the Whittle index.…”

Section: Related Workmentioning

confidence: 99%

“…To address this shortcoming in previous work, this paper presents the first algorithms for the online setting for multi-action RMABs. Indeed, the online setting for even binary-action RMABs has received only limited attention, in the works of Fu et al [8], Avrachenkov and Borkar [3], and Biswas et al [5,6]. These papers adopt variants of the Q-learning update rule [29,30], a well studied reinforcement learning algorithm, for estimating the effect of each action across changing dynamics of the systems.…”

Section: Introductionmentioning

confidence: 99%

Q-Learning Lagrange Policies for Multi-Action Restless Bandits

Killian,

Biswas,

Shah

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which 𝑁 independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as 𝑡 → ∞. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as 𝑡 → ∞. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data. CCS CONCEPTS• Computing methodologies → Reinforcement learning.

show abstract

The Digital Transformation in Health: How AI Can Improve the Performance of Health Systems

Periáñez,

Fernández Del Río,

Nazarov

et al. 2024

Health Systems & Reform

View full text Add to dashboard Cite

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

Cited by 15 publications

References 6 publications

Whittle Index based Q-Learning for Wireless Edge Caching with Linear Function Approximation

Whittle Index based Q-Learning for Wireless Edge Caching with Linear Function Approximation

Q-Learning Lagrange Policies for Multi-Action Restless Bandits

The Digital Transformation in Health: How AI Can Improve the Performance of Health Systems

Contact Info

Product

Resources

About