“…Raw features in human-machine conversations such as words with confidence scores can be given as input to a reinforcement learning agent to induce dialogue policies from interaction with the environment, where situations (words) are mapped to actions (dialogue acts) by maximizing a long-term reward signal [2]. An RL agent is typically characterized by: (i) a finite set of states S = {s 1 , ..., s n }; (ii) a finite set of actions A = {a 1 , ..., a m }; (iii) a state transition function T (s, a, s ) that specifies the next state s given the current state s and action a; (iv) a reward function R(s, a, s ) that specifies the reward given to the agent for choosing action a when the environment makes a transition from state s to state s ; and (v) a policy π : S → A that defines a mapping from states to actions.…”