Markov Decision Processes with Arbitrary Reward Processes

Yu, Jia Yuan; Mannor, Shie; Shimkin, Nahum

doi:10.1287/moor.1090.0397

Cited by 74 publications

(90 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dekel and Hazan [9] proves a matching upper bound, which implies that the (undiscounted) minimax regret of the ADMDP problem is Θ(T 2/3 ). The AD-MDP setting belongs to the more general class of adversarial MDPs with bandit feedback [17,15], where the state transitions are allowed to be stochastic. This implies a Ω(T 2/3 ) lower bound on the (undiscounted) minimax regret of the general setting.…”

Section: Lower Bound For Online Adversarial Markov Decision Processesmentioning

confidence: 99%

Bandits with switching costs

Dekel

Ding

Koren

et al. 2014

Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing

113

View full text Add to dashboard Cite

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T -round minimax regret in this setting is Θ(T 2/3 ), thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ( √ T ). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with fullinformation feedback (previous results only showed a different dependence on the number of actions, but not on T .)In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of Θ(T 2/3 ). Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is Θ(T 2/3 ). The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.

show abstract

Section: Lower Bound For Online Adversarial Markov Decision Processesmentioning

confidence: 99%

Bandits with switching costs

Dekel

Ding

Koren

et al. 2014

Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing

113

View full text Add to dashboard Cite

show abstract

“…They studied two possible models therein-one is that the transition matrix is stationary, chosen once by Nature at the beginning, and second, where Nature chooses at each timestep a matrix from the set. For this setting, they show how to compute the optimal policy using linear programming Following our initial publication, Yu et al [20] studied a similar model where the transition matrix is known and stationary and the rewards are chosen by an adversary. Their algorithm is based on following the perturbed leader and is computationally more efficient.…”

Section: Related Workmentioning

confidence: 99%

Online Markov Decision Processes

2009

View full text Add to dashboard Cite

We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions.1. Introduction. Finite state and actions Markov decision processes (MDPs) are a popular and attractive way to formulate many stochastic optimization problems ranging from robotics to finance (Puterman [17], Bertsekas and Tsitsiklis [2], Sutton and Barto [18]). Unfortunately, in many applications the Markovian assumption made is only a relaxation of the real model. A popular framework that is not Markovian is the experts problem, in which during every round a learner chooses one of n decision-making experts and incurs the loss of the chosen expert. The setting is typically an adversarial one, where Nature provides the examples to a learner. The standard objective here is a myopic, backwards-looking one-in retrospect, we desire that our performance is not much worse than had we chosen any single expert on the sequence of examples provided by Nature. Expert algorithms have played an important role in computer science in the past decade, solving problems varying from classification to online portfolios (see Littlestone and Warmuth [13], Blum and Kalai [3], Helmbold et al. [8]).There is an inherent tension between the objectives in an expert setting and those in a reinforcement learning (RL) setting. In contrast to the myopic nature of the expert algorithms, an RL setting typically makes the much stronger assumption of a fixed environment, and the forward-looking objective is to maximize some measure of the future reward with respect to this fixed environment. Therefore, in RL the past actions have a major influence on the current reward, whereas in the regret setting they have no influence. In this paper, we relax the Markovian assumption of the MDPs by letting the reward function be time dependent, and even chosen by an adversary as is done in the expert setting, but still keeping the underlying structure of an MDP.The motivation of this work is to understand how to efficiently incorporate the benefits of existing experts' algorithms into a more adversarial reinforcement learning setting, where certain aspects of the environment could change over time. A naive way to implement an experts' algorithm is to simply associate an expert with each fixed policy. The running time of such algorithms is polynomial in the number of experts, and the regret (the difference from the optimal reward) is logarithmic in the number of experts. For our setting, the number of policies is huge, namely, for an MDP with state space S and action space A we have A S polic...

show abstract

“…Firstly, the known regret-minimizing techniques do not apply directly because some sequences of transition functions may prevent ergodicity or create periodicity, which causes non-vanishing average regret [9]. Secondly, whereas it is possible to obtain asymptotically vanishing average regret when only the reward functions change arbitrarily, if both the transition probabilities and rewards change arbitrarily, as in our model, it is N P -hard to compute a policy that comes close to the best stationary policy [3].…”

Section: Hardnessmentioning

confidence: 96%

“…Suppose that the agent takes the transition model of Figure 1(a) as the nominal model and adopts a regret minimizing policy for MDPs-based on combining expert advice or perturbing optimal policies [3], [9]-based on this nominal model. After a long time, this policy assigns a high probability to taking the 'left' action.…”

Section: Examplesmentioning

confidence: 99%

“…These notions have been studied separately. Online learning in MDPs makes the solution robust against arbitrary variation in the reward functions when the transition probabilities are fixed [3], [9]. Robust dynamic programming has been used to control MDPs where the transition probabilities may vary arbitrarily, but where the reward functions may not [6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Online learning in Markov decision processes with arbitrarily changing rewards and transitions

Mannor

2009

2009 International Conference on Game Theory for Networks

Self Cite

View full text Add to dashboard Cite

We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., non-stationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performance evaluated in retrospect against alternative policies-i.e., their regret. These guarantees depend critically on the range of uncertainty in the transition probabilities, but hold regardless of the changes in rewards and transition probabilities over time. We present a version of the main algorithm in the setting where the decision-maker's observations are limited to its trajectory, and another version that allows a trade-off between performance and computational complexity.978-1-4244-4177-8/09/$25.00 ©2009 IEEE

show abstract

Markov Decision Processes with Arbitrary Reward Processes

Cited by 74 publications

References 19 publications

Bandits with switching costs

Bandits with switching costs

Online Markov Decision Processes

Online learning in Markov decision processes with arbitrarily changing rewards and transitions

Contact Info

Product

Resources

About