2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP) 2022
DOI: 10.1109/mlsp55214.2022.9943500
|View full text |Cite
|
Sign up to set email alerts
|

Data-Driven Robust Multi-Agent Reinforcement Learning

Abstract: Robust Markov decision processes (MDPs) address the challenge of model uncertainty by optimizing the worst-case performance over an uncertainty set of MDPs. In this paper, we focus on the robust average-reward MDPs under the modelfree setting. We first theoretically characterize the structure of solutions to the robust averagereward Bellman equation, which is essential for our later convergence analysis. We then design two model-free algorithms, robust relative value iteration (RVI) TD and robust RVI Q-learnin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(9 citation statements)
references
References 39 publications
0
9
0
Order By: Relevance
“…(2) The Monte Carlo method assumes that the value function of each state takes a value equal to the average of the returns G t of multiple episodes that must be executed in the termination state [100]. The value function of each state is the expectation of the payoff, and under the assumption of Monte Carlo reinforcement learning, the value function takes a value simplified from the expectation to the mean value.…”
Section: Model-free Reinforcement Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…(2) The Monte Carlo method assumes that the value function of each state takes a value equal to the average of the returns G t of multiple episodes that must be executed in the termination state [100]. The value function of each state is the expectation of the payoff, and under the assumption of Monte Carlo reinforcement learning, the value function takes a value simplified from the expectation to the mean value.…”
Section: Model-free Reinforcement Learningmentioning
confidence: 99%
“…According to the characteristics and requirements of the problem, choosing a suitable centralized reinforcement learning method can improve the learning effect and decision quality of the agent. Common algorithms include Q-learning, DQNs (deep Q-networks) [127], policy gradient methods [128], proximal policy optimization, etc. Q-learning is a basic centralized reinforcement learning method to make optimal decisions by learning a value function.…”
Section: Concentrated Reinforcement Learningmentioning
confidence: 99%
“…On regularizing the learning process, Kumar et al [20,22] introduce Q-learning and policy gradient methods for L p uncertainty sets, but do not experimentally evaluate their methods with experiments. Another type of uncertainty set considered in online robust RL is the R-contamination uncertainty, for which previous works have derived a robust Q-learning algorithm [40] and a regularized policy gradient algorithms [41]. R-contamination uncertainty assumes that the adversary can take the agent to any state, which is too conservative in practice.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, model-based methods that solve RMDPs [3,7,9,14,21,44] require access to the nominal transition probability, making it difficult to scale beyond tabular settings. While some recent works [21,22,42,43] introduce model-free methods that add regularization to the learning process, the effectiveness of their methods are not validated in high-dimensional environments. In addition, these methods are based on particular RL algorithms (e.g., policy gradient, Q learning), limiting their general applicability.…”
Section: Introductionmentioning
confidence: 99%
“…The PPO (Yang et al, 2018) algorithm is a reinforcement learning algorithm based on the strategy gradient method. It samples data through interaction with the environment and optimizes an “alternative” objective function using stochastic gradient ascent.…”
Section: Comparison Of Various Algorithmsmentioning
confidence: 99%