2020
DOI: 10.1080/01621459.2020.1770098
|View full text |Cite
|
Sign up to set email alerts
|

Statistical Inference for Online Decision Making: In a Contextual Bandit Setting

Abstract: Online decision-making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The ε-greedy policy is adopted to address the c… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
25
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 20 publications
(25 citation statements)
references
References 26 publications
0
25
0
Order By: Relevance
“…Since the optimal policy is unknown, we estimate the optimal policy from the online data as π t , to infer the value of interest. As commonly assumed in the current online inference literature (see e.g., Deshpande et al, 2018;Zhang et al, 2020;Chen et al, 2020) and the bandit literature (see e.g., Chu et al, 2011;Abbasi-Yadkori et al, 2011;Bubeck and Cesa-Bianchi, 2012;Zhou, 2015), we consider the conditional mean outcome function takes a linear form, i.e., µ(x, a) = x β(a), where β(•) is a smooth function, which can be estimated via a ridge regression based on H t−1 as…”
Section: Frameworkmentioning
confidence: 78%
See 3 more Smart Citations
“…Since the optimal policy is unknown, we estimate the optimal policy from the online data as π t , to infer the value of interest. As commonly assumed in the current online inference literature (see e.g., Deshpande et al, 2018;Zhang et al, 2020;Chen et al, 2020) and the bandit literature (see e.g., Chu et al, 2011;Abbasi-Yadkori et al, 2011;Bubeck and Cesa-Bianchi, 2012;Zhou, 2015), we consider the conditional mean outcome function takes a linear form, i.e., µ(x, a) = x β(a), where β(•) is a smooth function, which can be estimated via a ridge regression based on H t−1 as…”
Section: Frameworkmentioning
confidence: 78%
“…as the number of pulls for action a, y t−1 (a) is the N t−1 (a)×1 vector of the outcomes received under action a at time t − 1, and ω is the regularization term. There are two main reasons to choose the ridge estimator instead of the ordinary least square estimator that is considered in Deshpande et al (2018); Zhang et al (2020); Chen et al (2020). First, the ridge estimator is well defined when D t−1 (a) D t−1 (a) is singular, and its bias is negligible when the time step is large.…”
Section: Frameworkmentioning
confidence: 99%
See 2 more Smart Citations
“…They used ε-greedy method with ε being a function of the estimated treatment effect and gave the asymptotic distribution of the mean reward under the optimal decision rule. Chen, Lu, and Song (2020) studied the contextual bandit problem with a linear reward model and also adopted ε-greedy method but their ε is a function of decision steps. They gave the asymptotic distributions for the reward model parameters and the expected reward under the optimal decision rule.…”
Section: Introductionmentioning
confidence: 99%