“…Research in this direction was initiated by Mandl [12], Jaquette [5], [6], [8], Benito [1], and Sobel [17]. More recent extensions of these results can be found in [11], [9], and [16]. In particular, in these references the variance (or second moment) of the total expected discounted or average rewards of controlled, discretetime Markov reward chains was considered, to determine the 'best' policy within the class of discounted (or average) optimal policies and find a smaller variance (or lower second moment) of the cumulative reward.…”