High-Confidence Off-Policy Evaluation

In many reinforcement learning applications, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data. We empirically evaluate the proposed methods in a standard policy evaluation tasks.

show abstract

“…We use Monte Carlo rollouts to estimate V with MB. We also show results for importance sampling (IS) BCa-bootstrap methods from Thomas et. al.…”

Section: Resultsmentioning

confidence: 99%

“…The method is equally applicable to upper bounds and two sided intervals. A similar method using weighted importance sampling was proposed by Thomas et al (2015).…”

Section: Bootstrapping Policy Lower Boundsmentioning

confidence: 99%

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Hanna

Stone

Niekum

2017

AAAI

View full text Add to dashboard Cite

show abstract

“…Additional work on safety in MDPs has focused on obtaining high-confidence bounds on the performance of a policy before that policy is deployed (Thomas, Theocharous, and Ghavamzadeh 2015b;Hanna, Stone, and Niekum 2017), as well as methods for high-confidence policy improvement (Thomas, Theocharous, and Ghavamzadeh 2015a). Our work draws inspiration from these previous approaches; however, we provide bounds on policy performance that are applicable when learning from demonstrations, i.e., when the rewards are not observed.…”

Section: Related Workmentioning

confidence: 99%

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Brown

Niekum

2018

AAAI

View full text Add to dashboard Cite

In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting---where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the alpha-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.

show abstract

“…The evaluation policy (target distribution) might be a new treatment policy that is both dangerously worse than the behavior policy and quite different from the behavior policy. To determine whether the evaluation policy should be deployed, we might rely on high-confidence guarantees, as has been suggested for similar problems (Thomas et al 2015a). That is, we might use Hoeffding's inequality to construct a high-confidence lower-bound on the expected value of the WIS estimator, and then require this bound to be not far below the performance of the behavior policy.…”

Section: Should One Use Us or Wis In Practice?mentioning

confidence: 99%

“…This means that the lower-bound that we compute will be a lower bound on the performance of the decent behavior policy, rather the true poor performance of the evaluation policy. Moreover, if one uses Student's t-test or a bootstrap method to construct the confidence interval, as has been suggested when using WIS (Thomas et al 2015b), we might obtain a very-tight confidence interval around the performance of the behavior policy. This exemplifies the problem with using WIS for high-risk problems: the bias of the WIS estimator can cause us to often erroneously conclude that dangerous policies are safe to deploy.…”

Section: Should One Use Us or Wis In Practice?mentioning

confidence: 99%

Importance Sampling with Unequal Support

Thomas

Brunskill

2017

AAAI

Self Cite

View full text Add to dashboard Cite

Importance sampling is often used in machine learning when training and testing data come from different distributions. In this paper we propose a new variant of importance sampling that can reduce the variance of importance samplingbased estimates by orders of magnitude when the supports of the training and testing distributions differ. After motivating and presenting our new importance sampling estimator, we provide a detailed theoretical analysis that characterizes both its bias and variance relative to the ordinary importance sampling estimator (in various settings, which include cases where ordinary importance sampling is biased, while our new estimator is not, and vice versa). We conclude with an example of how our new importance sampling estimator can be used to improve estimates of how well a new treatment policy for diabetes will work for an individual, using only data from when the individual used a previous treatment policy.

show abstract

High-Confidence Off-Policy Evaluation

Cited by 84 publications

References 12 publications

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Importance Sampling with Unequal Support

Contact Info

Product

Resources

About