“…Finally, the last few years have seen the emergence of machine learning based approaches for resource allocation and interference mitigation in D2D enabled networks (e.g., Reference [ 14 ] and the references therein). However, in the literature [ 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 ], we observe that light-weight, time critical on-line mechanisms for adapting resource allocation and enhancing ESE are not available. Furthermore, such works use typically centralized approaches that do not exploit the OSA resources available in this type of scenario; when the approach is partly distributed, as in Reference [ 23 ], Q-learning is exploited just for the system throughput.…”