2020
DOI: 10.48550/arxiv.2006.05630
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Distributional Robust Batch Contextual Bandits

Abstract: Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data-an assumption that is often false or too coarse an approxim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(29 citation statements)
references
References 42 publications
0
29
0
Order By: Relevance
“…Interestingly, adversarial training can also be understood as solving DRO with a Wasserstein metric-based set [Staib and Jegelka, 2017]. Applications of DRO have been explored in contextual bandits for policy learning [Si et al, 2020, Mo et al, 2020, Faury et al, 2020 and evaluation [Kato et al, 2020, Jeong and. Uncertainty sets based on KL-divergence, L 1 , L 2 and L ∞ norms have been studied [Nilim andEl Ghaoui, 2005, Iyengar, 2005] in robust RL.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Interestingly, adversarial training can also be understood as solving DRO with a Wasserstein metric-based set [Staib and Jegelka, 2017]. Applications of DRO have been explored in contextual bandits for policy learning [Si et al, 2020, Mo et al, 2020, Faury et al, 2020 and evaluation [Kato et al, 2020, Jeong and. Uncertainty sets based on KL-divergence, L 1 , L 2 and L ∞ norms have been studied [Nilim andEl Ghaoui, 2005, Iyengar, 2005] in robust RL.…”
Section: Related Workmentioning
confidence: 99%
“…JointDRO accounts for shifts in all variables V and is therefore more conservative than the proposed method. Si et al [2020] recently proposed JointDRO using f -divergences for contextual bandits .…”
Section: Robust Ope In Banditsmentioning
confidence: 99%
“…The training environment, of course, is unknown, and only data logged by some policy is available. Focusing on a KL-divergence uncertainty set, Si et al [2020] tackle this distributionally robust OPE/L (DROPE/L) problem using an approach based on self-normalized inverse propensity scoring (SNIPS).…”
Section: Introductionmentioning
confidence: 99%
“…One may consider simply fitting and imputing the propensities using some flexible machine learning (ML) algorithms. However, although the SNIPS-based algorithms in Si et al [2020] are robust to environment shifts, they are not robust to propensity estimation errors. As the propensity estimates may converge at slow rates, the SNIPS estimators based on these estimated propensities also tend to converge slowly, leading to slow rates in estimation and learning.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation