2020
DOI: 10.48550/arxiv.2010.12870
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Learning in Non-Stationary Linear Markov Decision Processes

Abstract: We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose opt-wlsvi an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(20 citation statements)
references
References 10 publications
2
18
0
Order By: Relevance
“…This matches the existing bounds in the general non-stationary linear kernel MDP setting without any constraints (Zhong et al, 2021;Zhou et al, 2020;Touati & Vincent, 2020). The dependence on the variation budgets (B ∆ , B ⋆ ) also matches the existing bound in policy-based method for the nonstationary linear kernel MDP setting (Zhong et al, 2021), but is worse than the Q-learning based method (Zhou et al, 2020;Touati & Vincent, 2020). However, Q-learning based method cannot solve CMDP since the optimal solution of CMDP is usually a stochastic policy.…”
Section: Resultssupporting
confidence: 88%
See 3 more Smart Citations
“…This matches the existing bounds in the general non-stationary linear kernel MDP setting without any constraints (Zhong et al, 2021;Zhou et al, 2020;Touati & Vincent, 2020). The dependence on the variation budgets (B ∆ , B ⋆ ) also matches the existing bound in policy-based method for the nonstationary linear kernel MDP setting (Zhong et al, 2021), but is worse than the Q-learning based method (Zhou et al, 2020;Touati & Vincent, 2020). However, Q-learning based method cannot solve CMDP since the optimal solution of CMDP is usually a stochastic policy.…”
Section: Resultssupporting
confidence: 88%
“…Periodically restarted optimistic policy evaluation Except for using the restart strategy (Mao et al, 2020;Zhou et al, 2020) to adapt to non-stationarity in the policy evaluation step, other strategies following the forgetting principle, such as sliding window (Zhong et al, 2021;Cheung et al, 2020) or exponential decayed weights (Touati & Vincent, 2020), are also expected to work in our framework.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Although there is a huge body of literature on developing provably efficient RL methods, most the existing works focus on the classical stationary setting, with a few exceptions include Jaksch et al (2010); Gajane et al (2018); Cheung et al (2019aCheung et al ( ,c, 2020 Touati and Vincent (2020). However, these works all focus on value-based methods which only output greedy policies, and mostly focus on the tabular case where the state space is finite.…”
Section: Introductionmentioning
confidence: 99%