Counterfactual Multi-Agent Policy Gradients

Foerster, Jakob; Farquhar, Gregory; Afouras, Triantafyllos; Nardelli, Nantas; Whiteson, Shimon

doi:10.48550/arxiv.1705.08926

Cited by 47 publications

(91 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(iii) Counterfactual Multi-Agent Policy Gradients (COMA) is a CLDE based method which uses a single centralized critic for all the agents to estimate the Q-function and decentralised actors to optimise the agents' policies. 42 It addresses the credit assignment problem in multi-agent systems which earlier has been studied using difference rewards, an approach shaping the global reward such that agents are rewarded or penalized based on their contributions to the system's performance. These rewards are often found by estimating a reward function or using simulation, approaches which are not always trivial.…”

Section: In-depth Exploration Of Marl Algorithmsmentioning

confidence: 99%

Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training

Sharma¹,

Fernández²,

Zaroukian³

et al. 2021

Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III

View full text Add to dashboard Cite

Much work has been dedicated to the exploration of Multi-Agent Reinforcement Learning (MARL) paradigms implementing a centralized learning with decentralized execution (CLDE) approach to achieve human-like collab-oration in cooperative tasks. Here, we discuss variations of centralized training and describe a recent survey of algorithmic approaches. The goal is to explore how different implementations of information sharing mechanism in centralized learning may give rise to distinct group coordinated behaviors in multi-agent systems performing cooperative tasks.

show abstract

Section: In-depth Exploration Of Marl Algorithmsmentioning

confidence: 99%

Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training

Sharma¹,

Fernández²,

Zaroukian³

et al. 2021

Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III

View full text Add to dashboard Cite

show abstract

“…A branch of methods investigates factorizable Q functions (Sunehag et al, 2017;Rashid et al, 2018;Mahajan et al, 2019;Son et al, 2019) where the team Q is decomposed into individual utility functions. Some other methods adopt the actor-critic method where only the critic is centralized (Foerster et al, 2017;Lowe et al, 2017). However, most CTDE methods by structure require fixedsize teams and are often applied to homogeneous teams.…”

Section: Centralized Training With Decentralized Executionmentioning

confidence: 99%

Coach-Player Multi-Agent Reinforcement Learning for Dynamic Team Composition

Liu,

Stone

et al. 2021

Preprint

View full text Add to dashboard Cite

In real-world multiagent systems, agents with different capabilities may join or leave without altering the team's overarching goals. Coordinating teams with such dynamic composition is challenging: the optimal team strategy varies with the composition. We propose COPA, a coach-player framework to tackle this problem. We assume the coach has a global view of the environment and coordinates the players, who only have partial views, by distributing individual strategies. Specifically, we 1) adopt the attention mechanism for both the coach and the players; 2) propose a variational objective to regularize learning; and 3) design an adaptive communication method to let the coach decide when to communicate with the players. We validate our methods on a resource collection task, a rescue game, and the StarCraft micromanagement tasks. We demonstrate zeroshot generalization to new team compositions. Our method achieves comparable or better performance than the setting where all players have a full view of the environment. Moreover, we see that the performance remains high even when the coach communicates as little as 13% of the time using the adaptive communication strategy.

show abstract

“…In contrast, a total optimization objective value cannot be directly applied as the joint reward r p,jo m,t of each PA-agent for two reasons: 1) A global reward makes it difficult for each agent to deduce its individual contribution. The gradient computed for each actor does not explicitly reason about how the agent's actions contribute to the global reward [35]; and 2) different users (i.e., different agents) can have different weights to account for different priorities. The final reward r p m,t of PAagent m is r p m,t = r p,int m,t +r p,jo m,t .…”

Section: ) Reward Of the Sa Subtaskmentioning

confidence: 99%

Joint Resource Management for MC-NOMA: A Deep Reinforcement Learning Approach

Wang,

Lv,

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents a novel and effective deep reinforcement learning (DRL)-based approach to addressing joint resource management (JRM) in a practical multi-carrier nonorthogonal multiple access (MC-NOMA) system, where hardware sensitivity and imperfect successive interference cancellation (SIC) are considered. We first formulate the JRM problem to maximize the weighted-sum system throughput. Then, the JRM problem is decoupled into two iterative subtasks: subcarrier assignment (SA, including user grouping) and power allocation (PA). Each subtask is a sequential decision process. Invoking a deep deterministic policy gradient algorithm, our proposed DRL-based JRM (DRL-JRM) approach jointly performs the two subtasks, where the optimization objective and constraints of the subtasks are addressed by a new joint reward and internal reward mechanism. A multi-agent structure and a convolutional neural network are adopted to reduce the complexity of the PA subtask. We also tailor the neural network structure for the stability and convergence of DRL-JRM. Corroborated by extensive experiments, the proposed DRL-JRM scheme is superior to existing alternatives in terms of system throughput and resistance to interference, especially in the presence of many users and strong inter-cell interference. DRL-JRM can flexibly meet individual service requirements of users.

show abstract

Counterfactual Multi-Agent Policy Gradients

Cited by 47 publications

References 25 publications

Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training

Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training

Coach-Player Multi-Agent Reinforcement Learning for Dynamic Team Composition

Joint Resource Management for MC-NOMA: A Deep Reinforcement Learning Approach

Contact Info

Product

Resources

About