In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.
Connected automated vehicles (CAVs) could potentially be coordinated to safely attain the maximum traffic flow on roadways under dynamic traffic patterns, such as those engendered by the merger of two strings of vehicles due a lane drop. Strings of vehicles have to be shaped correctly in terms of the inter-vehicular time-gap and velocity to ensure that such operation is feasible. However, controllers that can achieve such traffic shaping over the multiple dimensions of target time-gap and velocity over a region of space are unknown. The objective of this work is to design such a controller, and to show that we can design candidate time-gap and velocity profiles such that it can stabilize the string of vehicles in attaining the target profiles. Our analysis is based on studying the system in the spacial rather than the time domain, which enables us to study stability as in terms of minimizing errors from the target profile and across vehicles as a function of location. Finally, we conduct numeral simulations in the context of shaping two platoons for merger, which we use to illustrate how to select time-gap and velocity profiles for maximizing flow and maintaining safety.
We consider the problem of learning an episodic safe control policy that minimizes an objective function, while satisfying necessary safety constraints -both during learning and deployment. We formulate this safety constrained reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function. Here, we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm, and show that it achieves an Õ(S 2 √ AH 7 K/( C − Cb )) cumulative regret without violating the safety constraints during learning, where S is the number of states, A is the number of actions, H is the horizon length, K is the number of learning episodes, and ( C − Cb ) is the safety gap, i.e., the difference between the constraint value and the cost of a known safe baseline policy. The scaling as Õ( √ K) is the same as the traditional approach where constraints may be violated during learning, which means that our algorithm suffers no additional regret in spite of providing a safety guarantee. Our key idea is to use an optimistic exploration approach with pessimistic constraint enforcement for learning the policy. This approach simultaneously incentivizes the exploration of unknown states while imposing a penalty for visiting states that are likely to cause violation of safety constraints. We validate our algorithm by evaluating its performance on benchmark problems against conventional approaches.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.