2012
DOI: 10.1287/opre.1110.0999
|View full text |Cite
|
Sign up to set email alerts
|

The Knowledge Gradient Algorithm for a General Class of Online Learning Problems

Abstract: We derive a one-period look-ahead policy for finite-and infinite-horizon online optimal learning problems with Gaussian rewards. Our approach is able to handle the case where our prior beliefs about the rewards are correlated, which is not handled by traditional multiarmed bandit methods. Experiments show that our KG policy performs competitively against the best-known approximation to the optimal policy in the classic bandit problem, and it outperforms many learning policies in the correlated case.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
88
0

Year Published

2012
2012
2017
2017

Publication Types

Select...
6
1

Relationship

3
4

Authors

Journals

citations
Cited by 123 publications
(89 citation statements)
references
References 38 publications
1
88
0
Order By: Relevance
“…Certain beliefs are assumed to exist among the alternative system designs while the data to be blended with the prior density function comes from the simulation itself. In particular, the knowledge gradient Biller, Biller, Corlu and Dulgeroglu policy -originally proposed for off-line ranking and selection problems -has been adapted to be used for online decision-making (and the study of multi-armed bandit problems); see Ryzhov, Powell, and Frazier (2012) for an example study of the theoretical foundation and Frazier, Powell, and Simao (2009) for an example calibration study via simulation for the transportation industry. Recently, Edwards, Fearnhead, and Glazebrook (2017) identify weaknesses of the knowledge gradient policy for online decision making and propose variants of the policy to overcome the weaknesses.…”
Section: Related Workmentioning
confidence: 99%
“…Certain beliefs are assumed to exist among the alternative system designs while the data to be blended with the prior density function comes from the simulation itself. In particular, the knowledge gradient Biller, Biller, Corlu and Dulgeroglu policy -originally proposed for off-line ranking and selection problems -has been adapted to be used for online decision-making (and the study of multi-armed bandit problems); see Ryzhov, Powell, and Frazier (2012) for an example study of the theoretical foundation and Frazier, Powell, and Simao (2009) for an example calibration study via simulation for the transportation industry. Recently, Edwards, Fearnhead, and Glazebrook (2017) identify weaknesses of the knowledge gradient policy for online decision making and propose variants of the policy to overcome the weaknesses.…”
Section: Related Workmentioning
confidence: 99%
“…In the optimal learning literature, Gaussian assumptions are standard due to advantages such as the ability to concisely model correlations between estimated values [19,20,21,22]. More recently, however, numerous applications have emerged where observations are clearly non-Gaussian.…”
Section: Goal Of This Dissertationmentioning
confidence: 99%
“…In a multi-armed bandit problem, the KG policy [21] considers all the alternatives together and calculates the expected improvement…”
Section: Difficulty With Non-gaussian Rewardsmentioning
confidence: 99%
“…It has since been studied in greater depth as the knowledge gradient by Frazier and Powell [9] for problems where the measurement noise is known, and by Chick and Branke [11] under the name of LL(1) (linear loss with batch size 1) for the case where the measurement noise is unknown. The idea was recently applied to on-line problems [24] for multiarmed bandit problems with both independent and correlated beliefs.…”
Section: The Knowledge Gradient Policymentioning
confidence: 99%
“…This idea is developed with much greater rigor in Ryzhov et al [24], which also reports on comparisons between the OLKG policy and other policies. It was found that the KG policy actually outperforms Gittins indices even when applied to multiarmed bandit problems (where Gittins is known to be optimal) when we use the approximation in Equation (8).…”
Section: On-line Learningmentioning
confidence: 99%