Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider modelbased RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish Õ( √ S 2 AH 4 K) regret for stochastic rewards. Furthermore, we prove Õ( √ S 2 AH 4 K 2/3 ) regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
Reinforcement learning typically assumes that the agent observes feedback from the environment immediately, but in many realworld applications (like recommendation systems) the feedback is observed in delay. Thus, we consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are only available at the end of episode k + d k , where the delays d k are neither identical nor bounded, and are chosen by an adversary. We present novel algorithms based on policy optimization that achieve nearoptimal high-probability regret of O( √ K + √ D) under full-information feedback, where K is the number of episodes and D = k d k is the total delay. Under bandit feedback, we prove similar O( √ K + √ D) regret assuming that the costs are stochastic, and O(K 2/3 + D 2/3 ) regret in the general case. To our knowledge, we are the first to consider the important setting of delayed feedback in adversarial MDPs.
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k + d k , where the delay d k can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal √ K + D regret, where K is the number of episodes and D = K k=1 d k is the total delay, significantly improving upon the best known regret bound of (K + D) 2/3 .
We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to learn to approximate the optimal policy as closely as possible. In this work we show that the minimax regret for this setting is O(B⋆ |S||A|K) where B⋆ is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the lower bound of Rosenberg et al. ( 2020) up to logarithmic factors, and improves their regret bound by a factor of |S|. Our algorithm runs in polynomial-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends only logarithmically on the horizon, yielding the same regret guarantees for SSP.
Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we present the adversarial SSP model that also accounts for adversarial changes in the costs over time, while the underlying transition function remains unchanged. Formally, an agent interacts with an SSP environment for K episodes, the cost function changes arbitrarily between episodes, and the transitions are unknown to the agent. We develop the first algorithms for adversarial SSPs and prove high probability regret bounds of square-root K assuming all costs are strictly positive, and sub-linear regret in the general case. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.