2016
DOI: 10.1017/s0269964816000279
|View full text |Cite
|
Sign up to set email alerts
|

On the Identification and Mitigation of Weaknesses in the Knowledge Gradient Policy for Multi-Armed Bandits

Abstract: The Knowledge Gradient (KG) policy was originally proposed for offline ranking and selection problems but has recently been adapted for use in online decisionmaking in general and multi-armed bandit problems (MABs) in particular. We study its use in a class of exponential family MABs and identify weaknesses, including a propensity to take actions which are dominated with respect to both exploitation and exploration. We propose variants of KG which avoid such errors. These new policies include an index heuristi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 20 publications
0
4
0
Order By: Relevance
“…The issue of continuous state spaces can be solved by using monotonicity properties of ν GI to bound GI values for any state using the GIs of similar states. From Edwards et al (2017), ν GI (cΣ, cn, γ) is decreasing in c ∈ R + for any fixed Σ, n, γ and is increasing in Σ for any fixed c, n, γ. With this result ν GI (Σ, n, γ) for a discrete grid of Σ and n can be used to bound and interpolate ν GI for any interior state.…”
Section: Using a Gi-based Policy In Practicementioning
confidence: 93%
See 1 more Smart Citation
“…The issue of continuous state spaces can be solved by using monotonicity properties of ν GI to bound GI values for any state using the GIs of similar states. From Edwards et al (2017), ν GI (cΣ, cn, γ) is decreasing in c ∈ R + for any fixed Σ, n, γ and is increasing in Σ for any fixed c, n, γ. With this result ν GI (Σ, n, γ) for a discrete grid of Σ and n can be used to bound and interpolate ν GI for any interior state.…”
Section: Using a Gi-based Policy In Practicementioning
confidence: 93%
“…The motivating problem is the classical Bayesian MAB. The notation used assumes reward distributions from the exponential family, as described in Edwards et al (2017), but the problems and solution framework is appropriate for general reward distributions.…”
Section: Problem Definitionmentioning
confidence: 99%
“…Frazier et al (2008, Section 7.2) showed that it is optimal for a search variant (i.e., maximizing the final-period expected reward) of the two-armed bandit problem with continuous responses. A specific variant of this approach for our setting was presented in Powell and Ryzhov (2012, Section 4.7.1) and further studied and improved in Edwards et al (2017).…”
Section: Approximately Optimal -Approximate Dynamic Programmingmentioning
confidence: 99%
“…In particular, the knowledge gradient Biller, Biller, Corlu and Dulgeroglu policy -originally proposed for off-line ranking and selection problems -has been adapted to be used for online decision-making (and the study of multi-armed bandit problems); see Ryzhov, Powell, and Frazier (2012) for an example study of the theoretical foundation and Frazier, Powell, and Simao (2009) for an example calibration study via simulation for the transportation industry. Recently, Edwards, Fearnhead, and Glazebrook (2017) identify weaknesses of the knowledge gradient policy for online decision making and propose variants of the policy to overcome the weaknesses.…”
Section: Related Workmentioning
confidence: 99%