Bernoulli multi-armed bandits are a reinforcement learning model used to study a variety of choice optimization problems. Often such optimizations concern a finite-time horizon. In principle, statistically optimal policies can be computed via dynamic programming, but doing so is considered infeasible due to prohibitive computational requirements and implementation complexity. Hence, suboptimal algorithms are applied in practice, despite their unknown level of suboptimality. In this article, we demonstrate that optimal policies can be efficiently computed for large time horizons or number of arms thanks to a novel memory organization and indexing scheme. We use optimal policies to gauge the suboptimality of several well-known finite-and infinitetime horizon algorithms including Whittle and Gittins indices, epsilon-greedy, Thompson sampling, and upper-confidence bound (UCB) algorithms. Our simulation study shows that all but one evaluated algorithm perform significantly worse than the optimal policy. The Whittle index offers a nearly optimal strategy for multiarmed Bernoulli bandits despite its suboptimal decisions-up to 10%-compared to an optimal policy table. Lastly, we discuss optimizations of known algorithms. We derive a novel solution from UCB1-tuned. It outperforms other infinite-time horizon algorithms when dealing with many arms.Impact statement-Bernoulli bandits are a reinforcement learning model used to improve decisions with binary outcomes. They have various applications ranging from headline news selection to clinical trials. Existing bandit algorithms are suboptimal. This article provides the first practical computation method, which determines the optimal decisions in Bernoulli bandits. It provides the lowest achievable decision regret (maximum expected benefit). In clinical trials, where an algorithm selects treatments for subsequent patients, our method can substantially reduce the number of unsuccessfully treated patients-by up to 5×. The optimal strategy is also used for new comprehensive evaluations of well-known suboptimal algorithms. This can significantly improve decision effectiveness in various applications. Index Terms-Clinical trials, epsilon-greedy, Gittins index (GI), multi-armed Bernoulli bandits, optimal policy (OPT), POKER, Thompson sampling (TS), upper-confidence bound (UCB), Whittle index (WI).
Bernoulli multi-armed bandits are a reinforcement learning model used to optimize the sequences of decisions with binary outcomes. Well-known bandit algorithms, including the optimal policy, assume that before a decision is made the outcomes of previous decisions are known. This assumption is often not satisfied in real-life scenarios. As demonstrated in this article, if decision outcomes are affected by delays, the performance of existing algorithms can be severely affected. We present the first practically applicable method to compute statistically optimal decisions in the presence of outcome delays. Our method has a predictive component abstracted out into a meta-algorithm, predictive algorithm reducing delay impact (PARDI), which significantly reduces the impact of delays on commonly used algorithms. We demonstrate empirically that PARDI-enhanced Whittle index is nearly optimal for a wide range of Bernoulli bandit parameters and delays. In a wide spectrum of experiments, it performed better than any other suboptimal algorithm, e.g., UCB1-tuned and Thompson sampling. PARDI-enhanced Whittle index can be used when computational requirements of the optimal policy are too high.Impact Statement-Bernoulli multi-armed bandit algorithms are used to optimize sequential binary decisions. Oftentimes, decisions must be made without knowing the results of some previous decisions, e.g., in clinical trials where finding out treatment outcomes takes time. Well-known bandit algorithms are ill-equipped to deal with still unknown (delayed) decision results, which may translate into significant losses, e.g., the number of unsuccessfully treated patients. We present the first method of determining the optimal strategy for these type of situations and a meta-algorithm PARDI that drastically improves the quality of decisions by wellknown algorithms-lowers regret by up to 3×. This is achieved by a 6× reduction in excess regret caused by delay. By addressing delays, this work can improve the quality of decisions in various applications. It opens new applications of Bernoulli bandits.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.