Safe Policies for Reinforcement Learning via Primal-Dual Methods

Paternain, Santiago; Calvo-Fullana, Miguel; Chamon, Luiz F. O.; Ribeiro, Alejandro

doi:10.48550/arxiv.1911.09101

Cited by 13 publications

(28 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CMDP. The study of RL algorithms for CMDPs has received considerable attention due to the safety requirement (Altman, 1999;Paternain et al, 2019;Yu et al, 2019;Dulac-Arnold et al, 2019;Garcıa & Fernández, 2015). Our work is closely related to Lagrangian-based CMDP algorithms with optimistic policy evaluations (Efroni et al, 2020;Singh et al, 2020;Ding et al, 2021;Liu et al, 2021;Qiu et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Ding¹,

Lavaei²

2022

Preprint

View full text Add to dashboard Cite

We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which play a central role in ensuring the safety of RL in timevarying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we propose a Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that features three mechanisms: periodic-restart-based policy improvement, dual update with dual regularization, and periodicrestart-based optimistic policy evaluation. We establish a dynamic regret bound and a constraint violation bound for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.

show abstract

Section: Related Workmentioning

confidence: 99%

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Ding¹,

Lavaei²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Constrained RL: Several policy-gradient algorithms have seen success in practice [29,27,24,19,1,32]. Also of interest are works which utilize Gaussian processes to model the transition probabilities and value functions [5,30,18,7].…”

Section: Related Workmentioning

confidence: 99%

“…Several policy-gradient-based algorithms have been proposed to solve CMDPs. Lagrangian-based methods [29,27,24,19] formulate the CMDP problem as a saddle-point problem and optimize it via primal-dual methods, while Constrained Policy Optimization [1,32] (inspired by the trust region policy optimization [26]) computes new dual variables from scratch at each update to maintain constraints during learning. Although these algorithms provide ways to learn an optimal policy, performance guarantees about reward regret, safety violation or sample complexity are rare.…”

Section: Introductionmentioning

confidence: 99%

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Liu

Zhou

Kalathil

et al. 2021

Preprint

View full text Add to dashboard Cite

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of Õ( √ K) while allowing an Õ( √ K) constraint violation in K episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order Õ( √ K). The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

show abstract

“…Moreover, the actor-critic approach is complicated and, therefore, it is difficult to characterize the convergence rate. Though a primaldual method with a good policy approximation is shown to converge to a neighborhood of the global optimum, the resulting policy may not even satisfy the constraint [25], [27]. A notable exception is the finite CMDP where the primal-dual methods can ensure the converge to an optimal policy [20], [28] and a sublinear convergence rate has been evaluated in [28].…”

Section: Introductionmentioning

confidence: 99%

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Zhao¹,

You²,

Başar³

2021

Preprint

View full text Add to dashboard Cite

While the techniques in optimal control theory are often model-based, the policy optimization (PO) approach can directly optimize the performance metric of interest without explicit dynamical models, and is an essential approach for reinforcement learning problems. However, it usually leads to a non-convex optimization problem in most cases, where there is little theoretical understanding on its performance. In this paper, we focus on the risk-constrained Linear Quadratic Regulator (LQR) problem with noisy input via the PO approach, which results in a challenging non-convex problem. To this end, we first build on our earlier result that the optimal policy has an affine structure to show that the associated Lagrangian function is locally gradient dominated with respect to the policy, based on which we establish strong duality. Then, we design policy gradient primal-dual methods with global convergence guarantees to find an optimal policy-multiplier pair in both model-based and sample-based settings. Finally, we use samples of system trajectories in simulations to validate our policy gradient primaldual methods.

show abstract

Safe Policies for Reinforcement Learning via Primal-Dual Methods

Cited by 13 publications

References 32 publications

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Global Convergence of Policy Gradient Primal-dual Methods for Risk-constrained LQRs

Contact Info

Product

Resources

About