The Advantage Regret-Matching Actor-Critic

Gruslys, Audrūnas; Lanctot, Marc; Munos, Rémi; Timbers, Finbarr; Schmid, Martin; Pérolat, Julien; Morrill, Dustin; Zambaldi, Vinicius; Lespiau, Jean-Baptiste; Schultz, John R.; Azar, Mohammad Gheshlaghi; Bowling, Michael; Tuyls, Karl

doi:10.48550/arxiv.2008.12234

Cited by 3 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MARL Algorithms in Zero-Sum Games MARL methods have been applied to zero-sum games tracing back to the TD-Gammon project (Tesauro 1995). A large body of work (Zinkevich et al 2007;Brown et al 2019;Steinberger, Lerer, and Brown 2020;Gruslys et al 2020) is based on regret minimization, and a well-known result is that the average of policies produced by self-play of regret-minimizing algorithms converges to the NE policy of zero-sum games (Freund and Schapire 1996). Another notable line of work (Littman 1994;Heinrich, Lanctot, and Silver 2015;Lanctot et al 2017;Perolat et al 2022) combines RL algorithms with game-theoretic approaches.…”

Section: Preliminary Markov Gamementioning

confidence: 99%

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

Chen,

Xu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips.

show abstract

Section: Preliminary Markov Gamementioning

confidence: 99%

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

Chen,

Xu,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Therefore, many neural variants of CFR have been proposed. They approximate the behaviour of CFR via neural networks to scale to large-scale games Li et al 2019;Steinberger 2019;Gruslys et al 2020;Hennes et al 2020;Fu et al 2021;McAleer et al 2022). At each iteration, these methods estimate the counterfactual regrets and update the strategy using the estimated counterfactual regrets.…”

Section: Related Workmentioning

confidence: 99%

“…Due to the large-scale state space in most real-world scenarios, it is impossible to traverse the entire game tree and use tables to represent strategies. To sidestep the issue, many neural variants of CFR have been proposed Li et al 2019;Gruslys et al 2020;Hennes et al 2020;Steinberger, Lerer, and Brown 2020;Fu et al 2021;McAleer et al 2022). At each time, they estimate the counterfactual regrets and update the strategy using the estimated regrets.…”

Section: Introductionmentioning

confidence: 99%

An Efficient Deep Reinforcement Learning Algorithm for Solving Imperfect Information Extensive-Form Games

Meng

Tian

et al. 2023

AAAI

View full text Add to dashboard Cite

One of the most popular methods for learning Nash equilibrium (NE) in large-scale imperfect information extensive-form games (IIEFGs) is the neural variants of counterfactual regret minimization (CFR). CFR is a special case of Follow-The-Regularized-Leader (FTRL). At each iteration, the neural variants of CFR update the agent's strategy via the estimated counterfactual regrets. Then, they use neural networks to approximate the new strategy, which incurs an approximation error. These approximation errors will accumulate since the counterfactual regrets at iteration t are estimated using the agent's past approximated strategies. Such accumulated approximation error causes poor performance. To address this accumulated approximation error, we propose a novel FTRL algorithm called FTRL-ORW, which does not utilize the agent's past strategies to pick the next iteration strategy. More importantly, FTRL-ORW can update its strategy via the trajectories sampled from the game, which is suitable to solve large-scale IIEFGs since sampling multiple actions for each information set is too expensive in such games. However, it remains unclear which algorithm to use to compute the next iteration strategy for FTRL-ORW when only such sampled trajectories are revealed at iteration t. To address this problem and scale FTRL-ORW to large-scale games, we provide a model-free method called Deep FTRL-ORW, which computes the next iteration strategy using model-free Maximum Entropy Deep Reinforcement Learning. Experimental results on two-player zero-sum IIEFGs show that Deep FTRL-ORW significantly outperforms existing model-free neural methods and OS-MCCFR.

show abstract

“…However, Deep CFR uses external sampling, which may be impractical for games with a large branching factor such as Stratego and Barrage Stratego. DREAM (Steinberger et al, 2020) and ARMAC (Gruslys et al, 2020) are model-free regret-based deep learning approaches.…”

Section: Related Workmentioning

confidence: 99%

XDO: A Double Oracle Algorithm for Extensive-Form Games

McAleer¹,

Lanier²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of infostates grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.

show abstract

The Advantage Regret-Matching Actor-Critic

Cited by 3 publications

References 14 publications

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

An Efficient Deep Reinforcement Learning Algorithm for Solving Imperfect Information Extensive-Form Games

XDO: A Double Oracle Algorithm for Extensive-Form Games

Contact Info

Product

Resources

About