2018
DOI: 10.48550/arxiv.1805.10066
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions

Abstract: We consider reinforcement learning in changing Markov Decision Processes where both the state-transition probabilities and the reward functions may vary over time. For this problem setting, we propose an algorithm using a sliding window approach and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. We also characterize the optimal window size suitable for our algorithm. These results are complemented by a sample complexity bound on the number of sub-optimal step… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
16
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(17 citation statements)
references
References 6 publications
1
16
0
Order By: Relevance
“…These bounds are known to be optimal even when L and ∆ are known, and they improve over [Cheung et al, 2019] for linear bandits, [Mao et al, 2020] for episodic tabular MDPs, and [Touati and Vincent, 2020] for episodic linear MDPs. For infinite-horizon MDPs, we achieve the same optimal regret when the maximum diameter of the MDPs is known, or when L and ∆ are known, improving over the best existing results by [Gajane et al, 2018] and [Cheung et al, 2020]. When none of them is known, we can still adopt the BoRL technique [Cheung et al, 2020] with the price of paying extra T 3/4 regret, which is suboptimal but still outperforms best known results.…”
Section: Settingmentioning
confidence: 71%
See 2 more Smart Citations
“…These bounds are known to be optimal even when L and ∆ are known, and they improve over [Cheung et al, 2019] for linear bandits, [Mao et al, 2020] for episodic tabular MDPs, and [Touati and Vincent, 2020] for episodic linear MDPs. For infinite-horizon MDPs, we achieve the same optimal regret when the maximum diameter of the MDPs is known, or when L and ∆ are known, improving over the best existing results by [Gajane et al, 2018] and [Cheung et al, 2020]. When none of them is known, we can still adopt the BoRL technique [Cheung et al, 2020] with the price of paying extra T 3/4 regret, which is suboptimal but still outperforms best known results.…”
Section: Settingmentioning
confidence: 71%
“…In particular, we emphasize that achieving dynamic regret Reg ⋆ L beyond (contextual) multi-armed bandits is one notable breakthrough we make. Indeed, even when L is known, previous approaches based on restarting after a fixed period, a sliding window with a fixed size, or discounting with a fixed discount factor, all lead to a suboptimal bound of O(L 1 3 T 2 3 ) at best [Gajane et al, 2018]. Since this bound is subsumed by Reg ⋆ ∆ , related discussions are also often omitted in previous works.…”
Section: Settingmentioning
confidence: 99%
See 1 more Smart Citation
“…Reinforcement learning for non-stationary MDPs: While there is some literature on regret minimization for MDPs with fixed transition kernel, but a changing sequence of cost functions [Yu et al, 2009, Ortner et al, 2020, the work on unknown non-stationary dynamics is much more recent [Gajane et al, 2018, Cheung et al, 2019b. The main idea is to use sliding window based estimators of the transition kernel, and design a policy based on an optimistic model of the transition dynamics within the confidence set.…”
Section: Related Workmentioning
confidence: 99%
“…Although there is a huge body of literature on developing provably efficient RL methods, most the existing works focus on the classical stationary setting, with a few exceptions include Jaksch et al (2010); Gajane et al (2018); Cheung et al (2019aCheung et al ( ,c, 2020 Touati and Vincent (2020). However, these works all focus on value-based methods which only output greedy policies, and mostly focus on the tabular case where the state space is finite.…”
Section: Introductionmentioning
confidence: 99%