Statistical Inference for Online Decision Making: In a Contextual Bandit Setting

Chen, Haoyu; Song, Rui

doi:10.1080/01621459.2020.1770098

Cited by 20 publications

(25 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the optimal policy is unknown, we estimate the optimal policy from the online data as π t , to infer the value of interest. As commonly assumed in the current online inference literature (see e.g., Deshpande et al, 2018;Zhang et al, 2020;Chen et al, 2020) and the bandit literature (see e.g., Chu et al, 2011;Abbasi-Yadkori et al, 2011;Bubeck and Cesa-Bianchi, 2012;Zhou, 2015), we consider the conditional mean outcome function takes a linear form, i.e., µ(x, a) = x β(a), where β(•) is a smooth function, which can be estimated via a ridge regression based on H t−1 as…”

Section: Frameworkmentioning

confidence: 78%

“…as the number of pulls for action a, y t−1 (a) is the N t−1 (a)×1 vector of the outcomes received under action a at time t − 1, and ω is the regularization term. There are two main reasons to choose the ridge estimator instead of the ordinary least square estimator that is considered in Deshpande et al (2018); Zhang et al (2020); Chen et al (2020). First, the ridge estimator is well defined when D t−1 (a) D t−1 (a) is singular, and its bias is negligible when the time step is large.…”

Section: Frameworkmentioning

confidence: 99%

“…We establish the relationship between p t and κ t in the next section and discuss when assumption 3.1 cannot hold. Assumption 3.2 is a technical condition on bounded contexts such that the mean of the martingale differences will converge to zero (see e.g., Zhang et al, 2020;Chen et al, 2020).…”

Section: Inference For Online Policy Optimizationmentioning

confidence: 99%

“…Chambaz et al (2017) established the asymptotic normality for the conditional mean outcome under an optimal policy for sequential decision making. Another work in Chen et al (2020) proposed an inverse probability weighted value estimator to infer the value of optimal policy using the EG method. These two works did not discuss how to account for the exploration-and-exploitation trade-off under commonly used bandit algorithms, such as UCB and TS considered in this paper.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Cai

Song

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.

show abstract

Section: Frameworkmentioning

confidence: 78%

Section: Frameworkmentioning

confidence: 99%

Section: Inference For Online Policy Optimizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Cai

Song

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…They used ε-greedy method with ε being a function of the estimated treatment effect and gave the asymptotic distribution of the mean reward under the optimal decision rule. Chen, Lu, and Song (2020) studied the contextual bandit problem with a linear reward model and also adopted ε-greedy method but their ε is a function of decision steps. They gave the asymptotic distributions for the reward model parameters and the expected reward under the optimal decision rule.…”

Section: Introductionmentioning

confidence: 99%

Statistical Inference for Online Decision Making via Stochastic Gradient Descent

Chen

Song

2020

Journal of the American Statistical Association

Self Cite

View full text Add to dashboard Cite

Online decision making aims to learn the optimal decision rule by making personalized decisions and updating the decision rule recursively. It has become easier than before with the help of big data, but new challenges also come along. Since the decision rule should be updated once per step, an offline update which uses all the historical data is inefficient in computation and storage. To this end, we propose a completely online algorithm that can make decisions and update the decision rule online via stochastic gradient descent. It is not only efficient but also supports all kinds of parametric reward models. Focusing on the statistical inference of online decision making, we establish the asymptotic normality of the parameter estimator produced by our algorithm and the online inverse probability weighted value estimator we used to estimate the optimal value. Online plugin estimators for the variance of the parameter and value estimators are also provided and shown to be consistent, so that interval estimation and hypothesis test are possible using our method. The proposed algorithm and theoretical results are tested by simulations and a real data application to news article recommendation.

show abstract

Policy Learning for Individualized Treatment Regimes on Infinite Time Horizon

Zhou,

Li,

Zhu

2024

ICSA Book Series in Statistics

View full text Add to dashboard Cite

Statistical Inference for Online Decision Making: In a Contextual Bandit Setting

Cited by 20 publications

References 26 publications

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Statistical Inference for Online Decision Making via Stochastic Gradient Descent

Policy Learning for Individualized Treatment Regimes on Infinite Time Horizon

Contact Info

Product

Resources

About