2021
DOI: 10.1109/tai.2021.3074122
|View full text |Cite
|
Sign up to set email alerts
|

Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Abstract: Bernoulli multi-armed bandits are a reinforcement learning model used to study a variety of choice optimization problems. Often such optimizations concern a finite-time horizon. In principle, statistically optimal policies can be computed via dynamic programming, but doing so is considered infeasible due to prohibitive computational requirements and implementation complexity. Hence, suboptimal algorithms are applied in practice, despite their unknown level of suboptimality. In this article, we demonstrate that… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(24 citation statements)
references
References 25 publications
0
24
0
Order By: Relevance
“…• We propose a coded grouping of the arms based on the arm indices and consequently, derive the probability of missed detection of change and the probability of false alarm and highlight the conditions to limit these probabilities. Based on this, we show that TS-GE achieves sub-linear regret, 5 } . We compare this bound with the best known bound of O( √ KN C T log T ) and discuss the conditions under which the bound of TS-GE outperforms the latter.…”
Section: B Motivation and Contributionmentioning
confidence: 80%
See 1 more Smart Citation
“…• We propose a coded grouping of the arms based on the arm indices and consequently, derive the probability of missed detection of change and the probability of false alarm and highlight the conditions to limit these probabilities. Based on this, we show that TS-GE achieves sub-linear regret, 5 } . We compare this bound with the best known bound of O( √ KN C T log T ) and discuss the conditions under which the bound of TS-GE outperforms the latter.…”
Section: B Motivation and Contributionmentioning
confidence: 80%
“…Sequential decision making problems in reinforcementlearning (RL) are popularly formulated using the multi-armed bandit (MAB) framework, wherein, an agent (or player) selects one or multiple options (or arms) out of a set of arms at each time slot [1]- [5]. The player performs such an action-selection based on the current estimate or belief of the expected reward of the arms.…”
Section: Introductionmentioning
confidence: 99%
“…There exists a wealth of published work on multi-armed bandits under a variety of assumptions [8], [14], [15], [18]- [20] with some work focusing exclusively on Bernoulli bandits [2]- [4], [6], [16], [21]- [23]. However, as stressed in [14], few articles have been published on any variation of bandits with delayed rewards.…”
Section: A Related Work and Motivationmentioning
confidence: 99%
“…Problem statement: Practically feasible methods to compute the optimal policy in the context of Bernoulli bandits for immediate rewards were recently proposed in [16], where the level of suboptimality of well-known algorithms was gauged. However, to the best of our knowledge, no computation of the optimal policy under delays has ever been published or proposed.…”
Section: A Related Work and Motivationmentioning
confidence: 99%
See 1 more Smart Citation