Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence 2021
DOI: 10.24963/ijcai.2021/466
|View full text |Cite
|
Sign up to set email alerts
|

Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts

Abstract: This paper investigates the model-based methods in multi-agent reinforcement learning (MARL). We specify the dynamics sample complexity and the opponent sample complexity in MARL, and conduct a theoretic analysis of return discrepancy upper bound. To reduce the upper bound with the intention of low sample complexity during the whole learning process, we propose a novel decentralized model-based MARL method, named Adaptive Opponent-wise Rollout Policy Optimization (AORPO). In AORPO, each agent builds its multi-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(4 citation statements)
references
References 3 publications
0
4
0
Order By: Relevance
“…In (Kamra et al 2020), the interaction graph-based trajectory prediction methods are suggested. In (Zhang et al 2021), a decen-tralized MBRL method is proposed with consideration of multiple opponent models. Unfortunately, these works either consider zero-sum games or treat the players as atomic and learn separate models for the players which could be computationally expensive.…”
Section: Related Workmentioning
confidence: 99%
“…In (Kamra et al 2020), the interaction graph-based trajectory prediction methods are suggested. In (Zhang et al 2021), a decen-tralized MBRL method is proposed with consideration of multiple opponent models. Unfortunately, these works either consider zero-sum games or treat the players as atomic and learn separate models for the players which could be computationally expensive.…”
Section: Related Workmentioning
confidence: 99%
“…From the analysis of [Zhang et al, 2021b], the sample efficiency of MARL can be decomposed into two parts, i.e., dynamics sample complexity, which measures the amount of interactions with the real environment, and the opponent sample complexity, which measures the amount of interactions between the ago agent i and other agents {−i}. With this regard, it is natural to derive the value discrepancy of the agent policy in the multi-agent environment model (i.e., with the state dynamics model and the opponent models) and the real environment with respect to the error terms of the state dynamics model and opponent models when training the policy in via Dyna-style model rollout.…”
Section: Multi-agent Rlmentioning
confidence: 99%
“…With this regard, it is natural to derive the value discrepancy of the agent policy in the multi-agent environment model (i.e., with the state dynamics model and the opponent models) and the real environment with respect to the error terms of the state dynamics model and opponent models when training the policy in via Dyna-style model rollout. The bound shows that the opponent models with higher modeling error contribute larger to the discrepancy bound, which motivates the design of the algorithm called adaptive opponent-wise rollout policy optimization (AORPO) [Zhang et al, 2021b]. Specifically, the rollout scheme of AORPO allows the opponent models with lower generalization errors to sample longer trajectories while the shorter sampled trajectories can be supplemented with a communication protocol with real opponents.…”
Section: Multi-agent Rlmentioning
confidence: 99%
“…In an adaptive system [29], an agent is constantly learning and making dynamic adaptive adjustments. An opponent model should be constructed to enable adaptive changes to be made based on the opponent's real-time strategy.…”
Section: Introductionmentioning
confidence: 99%