“…R-max# starts by initializing the counters n(s, a) = n(s, a, a ) = r (s, a) = 0, rewards to rmax and transitions to a fictitious state s 0 (like R-max) and set of known pairs K = ∅ (lines 1-4). Then, for each round the algorithm checks for each state-action pair (s, a) that is labeled as known (∈ K) how many rounds have passed since the last update (lines [8][9], if this number is greater than the threshold τ then the reward for that pair is set to rmax, the counters n(s, a), n(s, a, s ) and the transition function T (s, a, s ) are reset and a new policy is computed (lines [10][11][12][13][14]. Then, the algorithm behaves as R-max (lines [15][16][17][18][19][20][21][22][23][24].…”