“…Since the optimal policy is unknown, we estimate the optimal policy from the online data as π t , to infer the value of interest. As commonly assumed in the current online inference literature (see e.g., Deshpande et al, 2018;Zhang et al, 2020;Chen et al, 2020) and the bandit literature (see e.g., Chu et al, 2011;Abbasi-Yadkori et al, 2011;Bubeck and Cesa-Bianchi, 2012;Zhou, 2015), we consider the conditional mean outcome function takes a linear form, i.e., µ(x, a) = x β(a), where β(•) is a smooth function, which can be estimated via a ridge regression based on H t−1 as…”