Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Wei, Chen-Yu; Luo, Haipeng

doi:10.48550/arxiv.2102.05406

Cited by 4 publications

(10 citation statements)

References 6 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Non-stationary RL. Non-stationary RL has been mostly studied in the unconstrained setting (Jaksch et al, 2010;Auer et al, 2019;Ortner et al, 2020;Domingues et al, 2021;Mao et al, 2020;Zhou et al, 2020;Touati & Vincent, 2020;Fei et al, 2020;Zhong et al, 2021;Cheung et al, 2020;Wei & Luo, 2021). Our work is related to policy-based methods for non-stationary RL since the optimal solution of CMDP is usually a stochastic policy (Altman, 1999) and thus a policy-based method is preferred.…”

Section: Related Workmentioning

confidence: 99%

“…To eliminate the assumption of having prior knowledge on variation budgets, Wei & Luo (2021) recently outline that an adaptive restart approach can be used to convert any upper-confidence-bound-type stationary RL algorithm to a dynamic-regret-minimizing algorithm. However, this approach is proposed only for the unconstrained problems and relies on the assumption of having an optimistic estimator of the optimal value function.…”

Section: Related Workmentioning

confidence: 99%

“…Thus, their approach may not be directly applicable to our CMDP setting where a policy-based method is preferred. We leave it as an important open problem to handle the case without the prior knowledge on variation budgets and generalize the nonstationarity detection mechanism in Wei & Luo (2021) to our CMDP setting.…”

Section: Related Workmentioning

confidence: 99%

“…This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration. An interesting future direction is to relax the assumption on the prior knowledge of the variation budgets and generalize the non-stationarity detection mechanism inWei & Luo (2021) to our CMDP setting. Zhong, H., Yang, Z., and Szepesvári, Z. W. C. Optimistic policy optimization is provably efficient in nonstationary MDPs.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Ding¹,

Lavaei²

2022

Preprint

View full text Add to dashboard Cite

We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which play a central role in ensuring the safety of RL in timevarying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we propose a Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that features three mechanisms: periodic-restart-based policy improvement, dual update with dual regularization, and periodicrestart-based optimistic policy evaluation. We establish a dynamic regret bound and a constraint violation bound for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Ding¹,

Lavaei²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Second, the probabilities with which the exploration is carried out are different for the LQR problem owing to the quadratic cost. More recently, the authors in Wei and Luo [2021] outline that for many classes of episodic reinforcement learning problems, a similar strategy can be used to convert any Upper Confidence Bound (UCB) type stationary reinforcement learning algorithm to a dynamic regret minimizing algorithm. There are quite a few differences between Wei and Luo [2021] and our work: the LQR problem is not covered by the classes of MDPs they consider, we look at a non-episodic version of the LQR problem, and our algorithm is certainty equivalent controller-based and not a UCB-type.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

Luo,

Gupta,

Kolar

2021

Preprint

View full text Add to dashboard Cite

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q, R, but unknown and non-stationary dynamics {A t , B t }. The sequence of dynamics matrices can be arbitrary, but with a total variation, V T , assumed to be o(T ) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O V 2/5 T T 3/5 . With piece-wise constant dynamics, our algorithm achieves the optimal regret of O( √ ST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of V T . The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

show abstract

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Ding

Lavaei

2023

AAAI

View full text Add to dashboard Cite

We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which plays a central role in ensuring the safety of RL in time-varying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we identify two alternative conditions on the time-varying constraints under which we can guarantee the safety in the long run. We also propose the Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that can coordinate with both two conditions. Furthermore, a dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative conditions. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.

show abstract

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Cited by 4 publications

References 6 publications

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Contact Info

Product

Resources

About