2020
DOI: 10.48550/arxiv.2010.03152
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Projection-Based Constrained Policy Optimization

Abstract: We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoret… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
36
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(36 citation statements)
references
References 9 publications
0
36
0
Order By: Relevance
“…Safe RL [26], [28], [2], [35], [36] extends RL by adding constraints on the expectation of certain cost functions, which encode safety requirements or resource limits. CPO [26] derived a policy improvement step that increases the reward while satisfying the safety constraint.…”
Section: B Safe Reinforcement Learningmentioning
confidence: 99%
“…Safe RL [26], [28], [2], [35], [36] extends RL by adding constraints on the expectation of certain cost functions, which encode safety requirements or resource limits. CPO [26] derived a policy improvement step that increases the reward while satisfying the safety constraint.…”
Section: B Safe Reinforcement Learningmentioning
confidence: 99%
“…They are the simplest methods to address CMDPs, can easily be combined to existing policy gradient methods for solving regular MDPs, and have been shown to lead to good-performing feasible policies at convergence. Projection-based methods (Achiam et al 2017;Chow et al 2019;Yang et al 2020;Zhang, Vuong, and Ross 2020) instead use a projection step to try to map the policy back into a feasible region after the reward maximisation step. While they may reduce the number of constraint violations, they generally come at the cost of additional complexity.…”
Section: Related Work Constrained Reinforcement Learningmentioning
confidence: 99%
“…Such modularization essentially divides Problem (3) into an unconstrained optimization problem and a constraints satisfaction problem. This "divide and conquer" method not only enables simple end-to-end training, but also avoids the heavy computation to solve complex constrained optimization problems which is inevitable in previous solution methods for constrained Markov decision process [11,[18][19][20]. Furthermore, as shown in Figure 1(b), DeCOM incorporates the gradients from N i to update f i , because gradient sharing among agents could facilitate agent cooperation as shown by recent studies [21,22].…”
Section: Decom Frameworkmentioning
confidence: 99%
“…A wide variety of constrained reinforcement learning frameworks are proposed to solve constrained MDPs (CMDPs) [43]. They either convert a CMDP into an unconstrained min-max problem by introducing Lagrangian multipliers [12,14,[44][45][46][47][48], or seek to obtain the optimal policy by directly solving constrained optimization problems [11,13,[18][19][20][49][50][51]. However, it is hard to scale these single-agent methods to our multi-agent setting due to computational inefficiency.…”
Section: Related Workmentioning
confidence: 99%