“…Our approach belongs to a wider class of models that use information constraints for regularization to deal more efficiently with learning and decision-making problems [11], [31], [28], [25], [19], [35], [26], [18], [15], [3], [21], [38], [20], [17], [16]. One such prominent approach is Trust Region Policy Optimization (TRPO) [39].…”