“…For every configuration, we trained the system with 5 different random seeds, then evaluated the result for 10 episodes (playthroughs) for 50 evaluations total. Following existing work (Junyent, Jonsson, and Gómez 2019;Junyent, Gómez, and Jonsson 2021;Dittadi, Drachmann, and Bolander 2021), we use a discount factor of γ = 0.99 (rather than 0.995 in (Bandres, Bonet, and Geffner 2018)) in line 28. However, note that the reported final scores are undiscounted sums of rewards.…”