Projection-Based Constrained Policy Optimization

Yang, Tsung-Yen; Rosca, Justinian; Narasimhan, Karthik; Ramadge, Peter J.

doi:10.48550/arxiv.2010.03152

Cited by 25 publications

(36 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Safe RL [26], [28], [2], [35], [36] extends RL by adding constraints on the expectation of certain cost functions, which encode safety requirements or resource limits. CPO [26] derived a policy improvement step that increases the reward while satisfying the safety constraint.…”

Section: B Safe Reinforcement Learningmentioning

confidence: 99%

SABLAS: Learning Safe Control for Black-box Dynamical Systems

Qin¹,

Sun²,

Fan³

2022

Preprint

View full text Add to dashboard Cite

Control certificates based on barrier functions have been a powerful tool to generate probably safe control policies for dynamical systems. However, existing methods based on barrier certificates are normally for white-box systems with differentiable dynamics, which makes them inapplicable to many practical applications where the system is a black-box and cannot be accurately modeled. On the other side, model-free reinforcement learning (RL) methods for black-box systems suffer from lack of safety guarantees and low sampling efficiency. In this paper, we propose a novel method that can learn safe control policies and barrier certificates for black-box dynamical systems, without requiring for an accurate system model. Our method re-designs the loss function to back-propagate gradient to the control policy even when the black-box dynamical system is non-differentiable, and we show that the safety certificates hold on the black-box system. Empirical results in simulation show that our method can significantly improve the performance of the learned policies by achieving nearly 100% safety and goal reaching rates using much fewer training samples, compared to state-of-the-art blackbox safe control methods. Our learned agents can also generalize to unseen scenarios while keeping the original performance. The source code can be found at https://github.com/Zengyi-Qin/bcbf.

show abstract

Section: B Safe Reinforcement Learningmentioning

confidence: 99%

SABLAS: Learning Safe Control for Black-box Dynamical Systems

Qin¹,

Sun²,

Fan³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…They are the simplest methods to address CMDPs, can easily be combined to existing policy gradient methods for solving regular MDPs, and have been shown to lead to good-performing feasible policies at convergence. Projection-based methods (Achiam et al 2017;Chow et al 2019;Yang et al 2020;Zhang, Vuong, and Ross 2020) instead use a projection step to try to map the policy back into a feasible region after the reward maximisation step. While they may reduce the number of constraint violations, they generally come at the cost of additional complexity.…”

Section: Related Work Constrained Reinforcement Learningmentioning

confidence: 99%

Direct Behavior Specification via Constrained Reinforcement Learning

Roy¹,

Girgis²,

Romoff³

et al. 2021

Preprint

View full text Add to dashboard Cite

The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied Reinforcement Learning projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods, which seek to solve a min-max problem between the agent's policy and the Lagrangian multipliers, to automatically weigh each of the behavioral constraints. Specifically, we investigate how CMDPs can be adapted in order to solve goal-based tasks while adhering to a set of behavioral constraints and propose modifications to the SAC-Lagrangian algorithm to handle the challenging case of several constraints. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.

show abstract

“…Such modularization essentially divides Problem (3) into an unconstrained optimization problem and a constraints satisfaction problem. This "divide and conquer" method not only enables simple end-to-end training, but also avoids the heavy computation to solve complex constrained optimization problems which is inevitable in previous solution methods for constrained Markov decision process [11,[18][19][20]. Furthermore, as shown in Figure 1(b), DeCOM incorporates the gradients from N i to update f i , because gradient sharing among agents could facilitate agent cooperation as shown by recent studies [21,22].…”

Section: Decom Frameworkmentioning

confidence: 99%

“…A wide variety of constrained reinforcement learning frameworks are proposed to solve constrained MDPs (CMDPs) [43]. They either convert a CMDP into an unconstrained min-max problem by introducing Lagrangian multipliers [12,14,[44][45][46][47][48], or seek to obtain the optimal policy by directly solving constrained optimization problems [11,13,[18][19][20][49][50][51]. However, it is hard to scale these single-agent methods to our multi-agent setting due to computational inefficiency.…”

Section: Related Workmentioning

confidence: 99%

DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning

Yang¹,

Ding²,

Jin

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, multi-agent reinforcement learning (MARL) has presented impressive performance in various applications. However, physical limitations, budget restrictions, and many other factors usually impose constraints on a multi-agent system (MAS), which cannot be handled by traditional MARL frameworks. Specifically, this paper focuses on constrained MASes where agents work cooperatively to maximize the expected team-average return under various constraints on expected team-average costs, and develops a constrained cooperative MARL framework, named DeCOM, for such MASes. In particular, DeCOM decomposes the policy of each agent into two modules, which empowers information sharing among agents to achieve better cooperation. In addition, with such modularization, the training algorithm of DeCOM separates the original constrained optimization into an unconstrained optimization on reward and a constraints satisfaction problem on costs. DeCOM then iteratively solves these problems in a computationally efficient manner, which makes DeCOM highly scalable. We also provide theoretical guarantees on the convergence of DeCOM's policy update algorithm. Finally, we validate the effectiveness of DeCOM with various types of costs in both toy and large-scale (with 500 agents) environments.

show abstract

Projection-Based Constrained Policy Optimization

Cited by 25 publications

References 9 publications

SABLAS: Learning Safe Control for Black-box Dynamical Systems

SABLAS: Learning Safe Control for Black-box Dynamical Systems

Direct Behavior Specification via Constrained Reinforcement Learning

DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning

Contact Info

Product

Resources

About