2019
DOI: 10.1609/aaai.v33i01.33014634
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Counterfactual Learning from Bandit Feedback

Abstract: What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
15
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(15 citation statements)
references
References 13 publications
0
15
0
Order By: Relevance
“…However, in this work, we show that importance sampling with an estimated behavior policy lowers the variance of expectation evaluation in both on-and off-policy settings. Our work complements existing approaches in the causal inference (Rosenbaum 1987;Hirano et al 2003) and bandit (Li et al 2015;Narita et al 2019) literatures that has used importance sampling with an estimated behavior policy as a variance reduction strategy. We extend this general approach to sequential decision making tasks.…”
Section: Introductionmentioning
confidence: 89%
See 1 more Smart Citation
“…However, in this work, we show that importance sampling with an estimated behavior policy lowers the variance of expectation evaluation in both on-and off-policy settings. Our work complements existing approaches in the causal inference (Rosenbaum 1987;Hirano et al 2003) and bandit (Li et al 2015;Narita et al 2019) literatures that has used importance sampling with an estimated behavior policy as a variance reduction strategy. We extend this general approach to sequential decision making tasks.…”
Section: Introductionmentioning
confidence: 89%
“…For contextual bandits, Narita et al (2019) prove that importance sampling with an estimated behavior policy minimizes asymptotic variance among all asymptotically normal estimators (including ordinary importance sampling). They also provide a large-scale study of policy evaluation with the empirical behavior policy on an adplacement task.…”
Section: Importance Sampling With An Estimated Behavior Policymentioning
confidence: 98%
“…Both these approaches do not extend to general function classes and are not applicable to the extreme setting. In the context of off-policy learning from logged data there are several works that address the top-k arms selection problem under the context of slate recommendations (Swaminathan et al, 2017;Narita et al, 2019). We will now review the combinatorial action space literature in multi-armed bandit (MAB) problems.…”
Section: Related Workmentioning
confidence: 99%
“…For example, Hanna et al [2021] show how importance sampling with an estimated behavior policy can lower sampling error and lead to more accurate policy evaluation. Similar methods have also been studied for policy evaluation in multi-armed bandits Li et al [2015], Narita et al [2019], temporal-difference learning Pavse et al [2020], and policy gradient RL Hanna et al [2021]. All of these prior works assume that data is available a priori and ignore the question of how to collect data if it is unavailable.…”
Section: Related Workmentioning
confidence: 99%