“…We consider an infinite-horizon discounted constrained Markov decision process [79,6,4] -CMDP ( S, A, P, r, u, b, γ, ρ ) -where S, A are the action/action spaces, P is the transition kernel that specifies the transition probability P (s ′ | s, a) from state s to next state s ′ under action a ∈ A, r, u : S × A → [0, 1] are the reward/utility functions, b is the constraint threshold, γ ∈ [0, 1) is the discount factor, and ρ is the initial state distribution. A stationary stochastic policy π : S → ∆(A) determines a probability distribution ∆(A) over the action space A based on current state, i.e., a t ∼ π(• | s t ) at time t. Let Π be the set of all possible stochastic policies.…”