“…Since ( 4) and ( 5) hold true, the sequence of empirical errors {[( Pki h − P h )V * h+1 ](x, a)} K i=1 can be interpreted as a martingale difference sequence (MDS) with respect to the filtration {F} K i=1 [37]. Therefore, we can use the Azuma-Hoeffding inequality to give a concentration result [38] for each index in the MDS, i.e., to construct confidence bounds for Q * h ∀ h ∈ {1, 2, .…”