2021
DOI: 10.48550/arxiv.2110.08984
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Abstract: We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the periodically restarted optimistic policy optimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear functio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 45 publications
3
5
0
Order By: Relevance
“…When the variation budget is known a prior, (Fei et al 2020) propose the first policy-based method for non-stationary RL, but they assume stationary transitions and adversarial fullinformation rewards in the tabular setting. (Zhong, Yang, and Szepesvári 2021) extends the above results to a more general setting where both the transitions and rewards can vary over episodes. To eliminate the assumption of having prior knowledge on variation budgets, (Wei and Luo 2021) recently outline that an adaptive restart approach can be used to convert any upper-confidence-bound-type stationary RL algorithm to a dynamic-regret-minimizing algorithm.…”
Section: Related Worksupporting
confidence: 62%
See 2 more Smart Citations
“…When the variation budget is known a prior, (Fei et al 2020) propose the first policy-based method for non-stationary RL, but they assume stationary transitions and adversarial fullinformation rewards in the tabular setting. (Zhong, Yang, and Szepesvári 2021) extends the above results to a more general setting where both the transitions and rewards can vary over episodes. To eliminate the assumption of having prior knowledge on variation budgets, (Wei and Luo 2021) recently outline that an adaptive restart approach can be used to convert any upper-confidence-bound-type stationary RL algorithm to a dynamic-regret-minimizing algorithm.…”
Section: Related Worksupporting
confidence: 62%
“…Non-stationary RL. Non-stationary RL has been mostly studied in the unconstrained setting (Jaksch, Ortner, and Auer 2010;Auer, Gajane, and Ortner 2019;Ortner, Gajane, and Auer 2020;Domingues et al 2021;Mao et al 2020;Zhou et al 2020;Touati and Vincent 2020;Fei et al 2020;Zhong, Yang, and Szepesvári 2021;Cheung, Simchi-Levi, and Zhu 2020;Wei and Luo 2021). Our work is related to policy-based methods for non-stationary RL since the optimal solution of CMDP is usually a stochastic policy (Altman 1999) and thus a policy-based method is preferred.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al, 2020;Shani et al, 2020;Lancewicki et al, 2020;Fei et al, 2020;He et al, 2021;Zhong et al, 2021;Luo et al, 2021;Zanette et al, 2021). The most related policy-based method is proposed by Shani et al (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback.…”
Section: Related Workmentioning
confidence: 88%
“…Non-stationary RL has been mostly studied in the risk-neutral setting. When the variation budget is known a prior, a common strategy for adapting to the non-stationarity is to follow the forgetting principle, such as the restart strategy (Mao et al 2020;Zhou et al 2020;Zhao et al 2020;Ding and Lavaei 2022), exponential decayed weights (Touati and Vincent 2020), or sliding window (Cheung, Simchi-Levi, and Zhu 2020;Zhong, Yang, and Szepesvári 2021). In this work, we focus on the restart method mainly due to its advantage of the simplicity of the the memory efficiency (Zhao et al 2020) and generalize it to the risk-sensitive RL setting.…”
Section: Related Workmentioning
confidence: 99%