Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Pilarski, Sebastian; Pilarski, Slawomir; Varró, Dániel

doi:10.1109/tai.2021.3074122

Cited by 4 publications

(24 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• We propose a coded grouping of the arms based on the arm indices and consequently, derive the probability of missed detection of change and the probability of false alarm and highlight the conditions to limit these probabilities. Based on this, we show that TS-GE achieves sub-linear regret, 5 } . We compare this bound with the best known bound of O( √ KN C T log T ) and discuss the conditions under which the bound of TS-GE outperforms the latter.…”

Section: B Motivation and Contributionmentioning

confidence: 80%

“…Sequential decision making problems in reinforcementlearning (RL) are popularly formulated using the multi-armed bandit (MAB) framework, wherein, an agent (or player) selects one or multiple options (or arms) out of a set of arms at each time slot [1]- [5]. The player performs such an action-selection based on the current estimate or belief of the expected reward of the arms.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Actively Tracking the Optimal Arm in Non-Stationary Environments with Mandatory Probing

Ghatak¹

2022

Preprint

View full text Add to dashboard Cite

We study a novel multi-armed bandit (MAB) setting which mandates the agent to probe all the arms periodically in a non-stationary environment. In particular, we develop TS-GE that balances the regret guarantees of classical Thompson sampling (TS) with the broadcast probing (BP) of all the arms simultaneously in order to actively detect a change in the reward distributions. Once a system-level change is detected, the changed arm is identified by an optional subroutine called group exploration (GE) which scales as log 2 (K) for a K−armed bandit setting. We characterize the probability of missed detection and the probability of false-alarm in terms of the environment parameters. The latency of change-detection is upper bounded by √ T while within a period of √ T , all the arms are probed at least once. We highlight the conditions in which the regret guarantee of TS-GE outperforms that of the state-of-the-art algorithms, in particular, ADSWITCH and M-UCB. Furthermore, unlike the existing bandit algorithms, TS-GE can be deployed for applications such as timely status updates, critical control, and wireless energy transfer, which are essential features of nextgeneration wireless communication networks. We demonstrate the efficacy of TS-GE by employing it in a n industrial internetof-things (IIoT) network designed for simultaneous wireless information and power transfer (SWIPT).

show abstract

Section: B Motivation and Contributionmentioning

confidence: 80%

Section: Introductionmentioning

confidence: 99%

Actively Tracking the Optimal Arm in Non-Stationary Environments with Mandatory Probing

Ghatak¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There exists a wealth of published work on multi-armed bandits under a variety of assumptions [8], [14], [15], [18]- [20] with some work focusing exclusively on Bernoulli bandits [2]- [4], [6], [16], [21]- [23]. However, as stressed in [14], few articles have been published on any variation of bandits with delayed rewards.…”

Section: A Related Work and Motivationmentioning

confidence: 99%

“…Problem statement: Practically feasible methods to compute the optimal policy in the context of Bernoulli bandits for immediate rewards were recently proposed in [16], where the level of suboptimality of well-known algorithms was gauged. However, to the best of our knowledge, no computation of the optimal policy under delays has ever been published or proposed.…”

Section: A Related Work and Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Pilarski

Pilarski²,

Varró

2022

IEEE Trans. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

Bernoulli multi-armed bandits are a reinforcement learning model used to optimize the sequences of decisions with binary outcomes. Well-known bandit algorithms, including the optimal policy, assume that before a decision is made the outcomes of previous decisions are known. This assumption is often not satisfied in real-life scenarios. As demonstrated in this article, if decision outcomes are affected by delays, the performance of existing algorithms can be severely affected. We present the first practically applicable method to compute statistically optimal decisions in the presence of outcome delays. Our method has a predictive component abstracted out into a meta-algorithm, predictive algorithm reducing delay impact (PARDI), which significantly reduces the impact of delays on commonly used algorithms. We demonstrate empirically that PARDI-enhanced Whittle index is nearly optimal for a wide range of Bernoulli bandit parameters and delays. In a wide spectrum of experiments, it performed better than any other suboptimal algorithm, e.g., UCB1-tuned and Thompson sampling. PARDI-enhanced Whittle index can be used when computational requirements of the optimal policy are too high.Impact Statement-Bernoulli multi-armed bandit algorithms are used to optimize sequential binary decisions. Oftentimes, decisions must be made without knowing the results of some previous decisions, e.g., in clinical trials where finding out treatment outcomes takes time. Well-known bandit algorithms are ill-equipped to deal with still unknown (delayed) decision results, which may translate into significant losses, e.g., the number of unsuccessfully treated patients. We present the first method of determining the optimal strategy for these type of situations and a meta-algorithm PARDI that drastically improves the quality of decisions by wellknown algorithms-lowers regret by up to 3×. This is achieved by a 6× reduction in excess regret caused by delay. By addressing delays, this work can improve the quality of decisions in various applications. It opens new applications of Bernoulli bandits.

show abstract

Bandit Procedures for Designing Patient-Centric Clinical Trials

Villar¹,

Jacko²

2022

Springer Series in Supply Chain Management

View full text Add to dashboard Cite

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Cited by 4 publications

References 25 publications

Actively Tracking the Optimal Arm in Non-Stationary Environments with Mandatory Probing

Actively Tracking the Optimal Arm in Non-Stationary Environments with Mandatory Probing

Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI

Bandit Procedures for Designing Patient-Centric Clinical Trials

Contact Info

Product

Resources

About