“…Hence, a policy , is the mapping from states to actions , taken from that state, and represents the probability of selecting each possible action, in such a way that the best actions correspond to the highest probability of choice [11]. [8] explains that to evaluate the quality of the actions taken by the agent can be applied the concept of the "actionvalue function for policy ", that represents an estimation of the total return expected, i. e., the quality of the action taken by the agent when it is following some policy . This function represents the value of the expected total return to the state (current state) when the action is chosen and it follows, from that state, the policy , as shown in (7).…”