“…RL is suitably used for learning through trial and error by mapping situations that lead to the discovery of actions that gain the most reward (exploration) and executing actions to maximize a numerical reward signal (exploitation) (Sutton & Barto, 2018). Within the context of NorMAS, RL is a method to invoke convention emergence or norm emergence (Frantz et al, 2014(Frantz et al, , 2015Hosseini & Ulieru, 2012;Mashayekhi et al, 2022;Neufeld et al, 2021;Pujol et al, 2005;Riveret et al, 2014aRiveret et al, , 2014bSen & Airiau, 2007;Shoham & Tennenholtz, 1992, 1997Sugawara, 2014;Yu et al, 2013Yu et al, , 2014Yu et al, , 2015Yu et al, , 2017. The current de-facto standard algorithm used in past studies to induce norm emergence using RL is QL (Sutton & Barto, 2018;Watkins & Dayan, 1992), a model-free RL algorithm, in the case of NorMAS through social learning (learning from interactions with other agents).…”