Mastering the game of Stratego with model-free multiagent reinforcement learning

Pérolat, Julien; Vylder, Bart De; Hennes, Daniel; Tarassov, Eugene; Strub, Florian; Boer, Vincent C.J. de; Müller, Paul; Connor, Jerome T.; Burch, Neil; Anthony, Thomas; McAleer, Stephen; Élie, Romuald; Cen, Sarah H.; Wang, Zhe; Gruslys, Audrūnas; Malysheva, Aleksandra; Khan, Mina; Ozair, Sherjil; Timbers, Finbarr; Pohlen, Toby; Eccles, Tom; Rowland, Mark; Lanctot, Marc; Lespiau, Jean-Baptiste; Piot, Bilal; Omidshafiei, Shayegan; Lockhart, Edward; Sifre, Laurent; Beauguerlange, Nathalie; Munos, Rémi; Silver, David; Singh, Satinder; Hassabis, Demis; Tuyls, Karl

doi:10.1126/science.add4679

Cited by 76 publications

(40 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, when applied to CMDPs, minimizing the squared gradient with respect to the Lagrange multiplier(s) is equivalent to an apprenticeship learning problem (Abbeel & Ng, 2004;Zahavy et al, 2020a;Shani et al, 2022), which is itself a convex MDP representing a challenging optimization problem (Zahavy et al, 2021b). Perolat et al (2021) instead augment the objective with an adaptive regularizer, solving the resulting convex/concave (but biased) problem exactly before iteratively refitting with progressively lesser regularization.…”

Section: A Additional Related Workmentioning

confidence: 99%

ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs

Moskovitz¹,

O’Donoghue²,

Veeriah³

et al. 2023

Preprint

View full text Add to dashboard Cite

In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed lastiterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.

show abstract

Section: A Additional Related Workmentioning

confidence: 99%

ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs

Moskovitz¹,

O’Donoghue²,

Veeriah³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…MiniMaxKL Objectives in Two-Player Zero-Sum Games A number of recent prior works have made use of MiniMaxEnt and MiniMaxKL objectives for the purpose of inducing last iterate convergence (Perolat et al, 2021;Cen et al, 2021;Zeng et al, 2022;Sokota et al, 2022a;Perolat et al, 2022). While we also make use of these objectives, our use case (eliminating the noncorrespondence problem) differs substantially.…”

Section: Related Workmentioning

confidence: 99%

Abstracting Imperfect Information Away from Two-Player Zero-Sum Games

Sokota¹,

D'Orazio²,

Ling³

et al. 2023

Preprint

View full text Add to dashboard Cite

In their seminal work, showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned noncorrespondence problem-thus, computing them can be treated as perfect information problems. Because these regularized equilibria can be made arbitrarily close to Nash equilibria, our result opens the door to a new perspective on solving two-player zero-sum games and, in particular, yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning approaches.

show abstract

“…A real game system can involve a large number of strategies, most of which would be dominated during the process of finding Nash equilibrium [3,22]. As illustrated in Figure 1a, the full equilibrium finding process can be classified into three stages:…”

Section: Collapsementioning

confidence: 99%

“…This study has real life applications in fields including artificial intelligence [22] and and the social systems addressed in [9,25]. the collapse is a legitimate constituent part of the process of game evolution.…”

Section: Related Workmentioning

confidence: 99%