A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions

Gajane, Pratik; Ortner, Ronald; Auer, Péter

doi:10.48550/arxiv.1805.10066

Cited by 9 publications

(17 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These bounds are known to be optimal even when L and ∆ are known, and they improve over [Cheung et al, 2019] for linear bandits, [Mao et al, 2020] for episodic tabular MDPs, and [Touati and Vincent, 2020] for episodic linear MDPs. For infinite-horizon MDPs, we achieve the same optimal regret when the maximum diameter of the MDPs is known, or when L and ∆ are known, improving over the best existing results by [Gajane et al, 2018] and [Cheung et al, 2020]. When none of them is known, we can still adopt the BoRL technique [Cheung et al, 2020] with the price of paying extra T 3/4 regret, which is suboptimal but still outperforms best known results.…”

Section: Settingmentioning

confidence: 71%

“…In particular, we emphasize that achieving dynamic regret Reg ⋆ L beyond (contextual) multi-armed bandits is one notable breakthrough we make. Indeed, even when L is known, previous approaches based on restarting after a fixed period, a sliding window with a fixed size, or discounting with a fixed discount factor, all lead to a suboptimal bound of O(L 1 3 T 2 3 ) at best [Gajane et al, 2018]. Since this bound is subsumed by Reg ⋆ ∆ , related discussions are also often omitted in previous works.…”

Section: Settingmentioning

confidence: 99%

“…In these cases, it is much more meaningful to minimize dynamic regret, the gap between the total reward of the optimal sequence of policies and that of the learner. Indeed, there is a surge of studies on this topic recently [Jaksch et al, 2010, Gajane et al, 2018, Li and Li, 2019, Ortner et al, 2020, Cheung et al, 2020, Fei et al, 2020, Domingues et al, 2020, Mao et al, 2020, Touati and Vincent, 2020.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Wei,

Luo

2021

Preprint

View full text Add to dashboard Cite

We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity. By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret O(min{ √ LT , ∆ 1/3 T 2/3 }) where T is the number of rounds and L and ∆ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of L and ∆.

show abstract

Section: Settingmentioning

confidence: 71%

Section: Settingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Wei,

Luo

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Reinforcement learning for non-stationary MDPs: While there is some literature on regret minimization for MDPs with fixed transition kernel, but a changing sequence of cost functions [Yu et al, 2009, Ortner et al, 2020, the work on unknown non-stationary dynamics is much more recent [Gajane et al, 2018, Cheung et al, 2019b. The main idea is to use sliding window based estimators of the transition kernel, and design a policy based on an optimistic model of the transition dynamics within the confidence set.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

Luo,

Gupta,

Kolar

2021

Preprint

View full text Add to dashboard Cite

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q, R, but unknown and non-stationary dynamics {A t , B t }. The sequence of dynamics matrices can be arbitrary, but with a total variation, V T , assumed to be o(T ) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O V 2/5 T T 3/5 . With piece-wise constant dynamics, our algorithm achieves the optimal regret of O( √ ST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of V T . The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

show abstract

“…Although there is a huge body of literature on developing provably efficient RL methods, most the existing works focus on the classical stationary setting, with a few exceptions include Jaksch et al (2010); Gajane et al (2018); Cheung et al (2019aCheung et al ( ,c, 2020 Touati and Vincent (2020). However, these works all focus on value-based methods which only output greedy policies, and mostly focus on the tabular case where the state space is finite.…”

Section: Introductionmentioning

confidence: 99%

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Zhong¹,

Yang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the periodically restarted optimistic policy optimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodicrestart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a matching minimax lower bound which shows the (near-) optimality of the proposed methods. To our best knowledge, PROPO is the first provably efficient policy optimization algorithm that handles non-stationarity.

show abstract

A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions

Cited by 9 publications

References 6 publications

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Contact Info

Product

Resources

About