“…Our formulation postulates that the future environment-characterized by a joint distribution on the context and all the rewards when taking different actions-is in a Kullback-Leibler neighborhood around the training environment's distribution, thereby allowing for learning a robust policy from training data that is not sensitive to the future environment being the same as the past. Despite the fact that there has been a growing literature (see, e.g, [9,19,31,56,6,26,47,21,61,57,39,14,59,64,41,48,66,46,69,1,68,59,27,14,28,10,23,22,30]) on distributionally robust optimization (DRO)-one that shares the same philosophical underpinning on distributionally robustness as ours-the existing DRO literature has mostly focused on the statistical learning aspects, including supervised learning and feature selection type problems, rather than the decision making aspects. To the best of our knowledge, we provide the first distributionally robust formulation for policy evaluation and learning under bandit feedback, in a general, non-parametric space.…”