2017
DOI: 10.48550/arxiv.1705.08926
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Counterfactual Multi-Agent Policy Gradients

Abstract: Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agent… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
90
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 47 publications
(91 citation statements)
references
References 25 publications
1
90
0
Order By: Relevance
“…(iii) Counterfactual Multi-Agent Policy Gradients (COMA) is a CLDE based method which uses a single centralized critic for all the agents to estimate the Q-function and decentralised actors to optimise the agents' policies. 42 It addresses the credit assignment problem in multi-agent systems which earlier has been studied using difference rewards, an approach shaping the global reward such that agents are rewarded or penalized based on their contributions to the system's performance. These rewards are often found by estimating a reward function or using simulation, approaches which are not always trivial.…”
Section: In-depth Exploration Of Marl Algorithmsmentioning
confidence: 99%
“…(iii) Counterfactual Multi-Agent Policy Gradients (COMA) is a CLDE based method which uses a single centralized critic for all the agents to estimate the Q-function and decentralised actors to optimise the agents' policies. 42 It addresses the credit assignment problem in multi-agent systems which earlier has been studied using difference rewards, an approach shaping the global reward such that agents are rewarded or penalized based on their contributions to the system's performance. These rewards are often found by estimating a reward function or using simulation, approaches which are not always trivial.…”
Section: In-depth Exploration Of Marl Algorithmsmentioning
confidence: 99%
“…A branch of methods investigates factorizable Q functions (Sunehag et al, 2017;Rashid et al, 2018;Mahajan et al, 2019;Son et al, 2019) where the team Q is decomposed into individual utility functions. Some other methods adopt the actor-critic method where only the critic is centralized (Foerster et al, 2017;Lowe et al, 2017). However, most CTDE methods by structure require fixedsize teams and are often applied to homogeneous teams.…”
Section: Centralized Training With Decentralized Executionmentioning
confidence: 99%
“…In contrast, a total optimization objective value cannot be directly applied as the joint reward r p,jo m,t of each PA-agent for two reasons: 1) A global reward makes it difficult for each agent to deduce its individual contribution. The gradient computed for each actor does not explicitly reason about how the agent's actions contribute to the global reward [35]; and 2) different users (i.e., different agents) can have different weights to account for different priorities. The final reward r p m,t of PAagent m is r p m,t = r p,int m,t +r p,jo m,t .…”
Section: ) Reward Of the Sa Subtaskmentioning
confidence: 99%