“…Data: π t (x, y), t = 1, δ, θ * , θ, γ k s,n , γ k u,m and 1 ≤ µ ≤ F ≤ ν Result: RB, P k s,n allocation for RUEs initialization of Learning for each(x, y ∈ Y ) do initialize resource allocation strategy π t (x, y); initialize approximated Q-value ξ t ψ T (x, y); end while (true) do evaluate the state x = x t if (t < ν + 1) then Select action y according to π t (x, y) in (20); if (C1 to C7 are satisfied ) then R(x, y) is achieved else R(x, y) = 0 end else Update Yu(x t ) = {y|Rc(x, y) = 1} for x t Randomly select YF (XF (x t , F (t))) out of F joint actions associated with XF (x t , F (t))…”