2018
DOI: 10.1609/aaai.v32i1.11794
|View full text |Cite
|
Sign up to set email alerts
|

Counterfactual Multi-Agent Policy Gradients

Abstract: Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agent… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
373
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 952 publications
(373 citation statements)
references
References 34 publications
0
373
0
Order By: Relevance
“…When a high-fidelity simulator is available, this may not be problematic and in this case one may also consider improving performance by applying centralised training with decentralised execution in the online phase (e.g. [74,75]). Alternatively, for rapid adaptation, (theoretical or empirical) demonstrations of sample complexity are required rather than the asymptotic convergence guarantees for model-free MARL (e.g.…”
Section: Multi-agent Reinforcement Learningmentioning
confidence: 99%
“…When a high-fidelity simulator is available, this may not be problematic and in this case one may also consider improving performance by applying centralised training with decentralised execution in the online phase (e.g. [74,75]). Alternatively, for rapid adaptation, (theoretical or empirical) demonstrations of sample complexity are required rather than the asymptotic convergence guarantees for model-free MARL (e.g.…”
Section: Multi-agent Reinforcement Learningmentioning
confidence: 99%
“…The goal of MARL is to derive decentralized policies for agents and impose a consensus to conduct a collaborative task. To achieve this, the multi-agent deep deterministic policy gradient (MADDPG) [22] and counterfactual multi-agent (COMA) [23] construct a centralized critic to train decentralized actors by augmenting it with extra information about other agents, such as observations and actions. Compared with independent learning [24], which only uses local information, MADDPG and COMA can derive better policies in a non-stationary environment.…”
Section: Related Workmentioning
confidence: 99%
“…After training, θ π l and θ Q l are updated as [40]. Our method can be extended to continuous action space by estimating the expectation of b i with Monte Carlo samples or a learnable state value function V(o i , m i ) [23].…”
Section: Implementation In An Actor-critic Frameworkmentioning
confidence: 99%
“…A key problem while learning from global rewards in multiagent setting is that the gradient computed for an agent i does not explicitly reason about the contribution of that agent to the global team reward. As a result, the gradient becomes noisy given that other agents are also exploring, leading to poor quality solutions (Foerster et al 2017;Bagnell and Ng 2005). Fortunately, creating a separation among local MDPs of agents and joint event-based rewards automatically addresses this problem of noisy gradient in TIDec-MDPs.…”
Section: Multiagent Credit Assignmentmentioning
confidence: 99%