Aviv Rosenberg scite author profile

Aviv Rosenberg

5Publications

122Citation Statements Received

98Citation Statements Given

How they've been cited

118

How they cite others

Affiliations

Amazon (United States), Tel Aviv University

Publications

Order By: Most citations

Optimistic Policy Optimization with Bandit Feedback

Efroni¹,

Shani²,

Rosenberg³

et al. 2020

Preprint

View full text Add to dashboard Cite

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider modelbased RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish Õ( √ S 2 AH 4 K) regret for stochastic rewards. Furthermore, we prove Õ( √ S 2 AH 4 K 2/3 ) regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.

show abstract

Learning Adversarial Markov Decision Processes with Delayed Feedback

Lancewicki¹,

Rosenberg²,

Mansour³

2020

Preprint

View full text Add to dashboard Cite

Reinforcement learning typically assumes that the agent observes feedback from the environment immediately, but in many realworld applications (like recommendation systems) the feedback is observed in delay. Thus, we consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are only available at the end of episode k + d k , where the delays d k are neither identical nor bounded, and are chosen by an adversary. We present novel algorithms based on policy optimization that achieve nearoptimal high-probability regret of O( √ K + √ D) under full-information feedback, where K is the number of episodes and D = k d k is the total delay. Under bandit feedback, we prove similar O( √ K + √ D) regret assuming that the costs are stochastic, and O(K 2/3 + D 2/3 ) regret in the general case. To our knowledge, we are the first to consider the important setting of delayed feedback in adversarial MDPs.

show abstract

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Jin¹,

Lancewicki²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k + d k , where the delay d k can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal √ K + D regret, where K is the number of episodes and D = K k=1 d k is the total delay, significantly improving upon the best known regret bound of (K + D) 2/3 .

show abstract

Minimax Regret for Stochastic Shortest Path

Cohen¹,

Efroni²,

Mansour³

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to learn to approximate the optimal policy as closely as possible. In this work we show that the minimax regret for this setting is O(B⋆ |S||A|K) where B⋆ is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the lower bound of Rosenberg et al. ( 2020) up to logarithmic factors, and improves their regret bound by a factor of |S|. Our algorithm runs in polynomial-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends only logarithmically on the horizon, yielding the same regret guarantees for SSP.

show abstract

Stochastic Shortest Path with Adversarially Changing Costs

Rosenberg

Mansour

2021

View full text Add to dashboard Cite

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we present the adversarial SSP model that also accounts for adversarial changes in the costs over time, while the underlying transition function remains unchanged. Formally, an agent interacts with an SSP environment for K episodes, the cost function changes arbitrarily between episodes, and the transitions are unknown to the agent. We develop the first algorithms for adversarial SSPs and prove high probability regret bounds of square-root K assuming all costs are strictly positive, and sub-linear regret in the general case. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Aviv Rosenberg

Optimistic Policy Optimization with Bandit Feedback

Learning Adversarial Markov Decision Processes with Delayed Feedback

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Minimax Regret for Stochastic Shortest Path

Stochastic Shortest Path with Adversarially Changing Costs

Contact Info

Product

Resources

About