“…However, if Q is used, then during the training of the RL controller, it will often be necessary to determine the optimal action a = max a Q(s, a ) (5-2-1) for some state s. Thanks to the nonlinear nature of neural networks, the maximization in the above relation is very hard to do. It is possible to find the maximum (possibly using interval analysis, see (de Weerdt, Chu, & Mulder, 2009)) but this will computationally be very expensive, making the algorithm practically unfeasible. Thus, as value function V will be used.…”