2021
DOI: 10.48550/arxiv.2103.07780
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Online Double Oracle

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(10 citation statements)
references
References 0 publications
0
10
0
Order By: Relevance
“…These algorithms involving combining singleagent RL algorithm, such as deep Q-network (DQN) [35] or proximal policy optimisation (PPO) [38], with best-response-based Nash equilibrium finding algorithm for normal-form games, including fictitious play [10], double oracle [34], and many others. A few other examples such as neural fictitious self-play (NFSP) [21], policy space response oracles (PSRO) [28], online double oracle [13] and prioritized fictitious self-play [45] also in general fall into this class of algorithms or their variants. These algorithms call single-agent RL algorithm to compute the base response of the current "metastrategy", and then use the best-response-based algorithm for normal-form games to compute a new "meta-strategy".…”
Section: Discussionmentioning
confidence: 99%
“…These algorithms involving combining singleagent RL algorithm, such as deep Q-network (DQN) [35] or proximal policy optimisation (PPO) [38], with best-response-based Nash equilibrium finding algorithm for normal-form games, including fictitious play [10], double oracle [34], and many others. A few other examples such as neural fictitious self-play (NFSP) [21], policy space response oracles (PSRO) [28], online double oracle [13] and prioritized fictitious self-play [45] also in general fall into this class of algorithms or their variants. These algorithms call single-agent RL algorithm to compute the base response of the current "metastrategy", and then use the best-response-based algorithm for normal-form games to compute a new "meta-strategy".…”
Section: Discussionmentioning
confidence: 99%
“…This can be achieved through various means, such as monotonic improvement in exploitability [55], regret bound [64]. Another approach to improving efficiency in PB-DRL is to use distributed computing techniques [43,65]. These techniques can enable faster evaluation of individuals within a population, as well as better parallelization of the learning process.…”
Section: Challengesmentioning
confidence: 99%
“…bilinear game is usually denoted by the payoff matrices pair (A, B), which is zero-sum when B = −A, and as an important notion, the rank of a game (A, B) is defined as the rank of matrix A + B. Several interesting games can be viewed as special cases of bilinear games, such as bimatrix games [47]- [49], where…”
Section: A Zero-sum Games (Zsgs)mentioning
confidence: 99%
“…For normal-form games, a large number of algorithms have so far been proposed, e.g., regret matching (RM for short, first proposed by Hart and Mas-Colell in 2000 [220]), RM+ [221], fictitious play [222], [223], double oracle [224], online double oracle [49], and among others. Wherein, the most prevalent algorithms are based on regret learning, usually called noregret (or sublinear) learning algorithms, depending external and internal regrets in general, as defined below.…”
Section: A Zero-sum Normal-and Extensive-form Gamesmentioning
confidence: 99%