2022
DOI: 10.48550/arxiv.2210.05367
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Explicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning via Polarization Policy Gradient

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…Multi-Agent Policy Gradient The policy gradient in stochastic MAPG methods has the form (Foerster et al 2018;Wang et al 2020c;Lou et al 2023b;Chen et al 2022), where objective G i varies across different methods, such as counterfactual advantage (Foerster et al 2018) and polarized joint-action value (Chen et al 2022). The objective in DOP is individual aristocratic utility (Wolpert and Tumer 2001), which ignores other agents' utilities to avoid the CDM issue, but the cooperation is also limited by this objective.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Multi-Agent Policy Gradient The policy gradient in stochastic MAPG methods has the form (Foerster et al 2018;Wang et al 2020c;Lou et al 2023b;Chen et al 2022), where objective G i varies across different methods, such as counterfactual advantage (Foerster et al 2018) and polarized joint-action value (Chen et al 2022). The objective in DOP is individual aristocratic utility (Wolpert and Tumer 2001), which ignores other agents' utilities to avoid the CDM issue, but the cooperation is also limited by this objective.…”
Section: Related Workmentioning
confidence: 99%
“…The objective in DOP is individual aristocratic utility (Wolpert and Tumer 2001), which ignores other agents' utilities to avoid the CDM issue, but the cooperation is also limited by this objective. It is worth noting that polarized joint-action value (Chen et al 2022) (Zhang et al 2021;Peng et al 2021;Zhou, Lan, and Aggarwal 2022) adopt value factorization to mix individual Q values to get Q π tot . As the global Q value is determined by the centralized critic for all agents, sub-optimal actions of one agent will easily influence all others.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation