2009
DOI: 10.1287/moor.1090.0397
|View full text |Cite
|
Sign up to set email alerts
|

Markov Decision Processes with Arbitrary Reward Processes

Abstract: We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well—in hindsight—as every stationary policy. This generalizes the classical no-regret result for repeated games. Specifically, we present an efficient online algorithm—in the spirit of reinforcement learning—that ensures that the agent's… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
90
0

Year Published

2009
2009
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 74 publications
(90 citation statements)
references
References 19 publications
0
90
0
Order By: Relevance
“…Dekel and Hazan [9] proves a matching upper bound, which implies that the (undiscounted) minimax regret of the ADMDP problem is Θ(T 2/3 ). The AD-MDP setting belongs to the more general class of adversarial MDPs with bandit feedback [17,15], where the state transitions are allowed to be stochastic. This implies a Ω(T 2/3 ) lower bound on the (undiscounted) minimax regret of the general setting.…”
Section: Lower Bound For Online Adversarial Markov Decision Processesmentioning
confidence: 99%
“…Dekel and Hazan [9] proves a matching upper bound, which implies that the (undiscounted) minimax regret of the ADMDP problem is Θ(T 2/3 ). The AD-MDP setting belongs to the more general class of adversarial MDPs with bandit feedback [17,15], where the state transitions are allowed to be stochastic. This implies a Ω(T 2/3 ) lower bound on the (undiscounted) minimax regret of the general setting.…”
Section: Lower Bound For Online Adversarial Markov Decision Processesmentioning
confidence: 99%
“…They studied two possible models therein-one is that the transition matrix is stationary, chosen once by Nature at the beginning, and second, where Nature chooses at each timestep a matrix from the set. For this setting, they show how to compute the optimal policy using linear programming Following our initial publication, Yu et al [20] studied a similar model where the transition matrix is known and stationary and the rewards are chosen by an adversary. Their algorithm is based on following the perturbed leader and is computationally more efficient.…”
Section: Related Workmentioning
confidence: 99%
“…Firstly, the known regret-minimizing techniques do not apply directly because some sequences of transition functions may prevent ergodicity or create periodicity, which causes non-vanishing average regret [9]. Secondly, whereas it is possible to obtain asymptotically vanishing average regret when only the reward functions change arbitrarily, if both the transition probabilities and rewards change arbitrarily, as in our model, it is N P -hard to compute a policy that comes close to the best stationary policy [3].…”
Section: Hardnessmentioning
confidence: 96%
“…Suppose that the agent takes the transition model of Figure 1(a) as the nominal model and adopts a regret minimizing policy for MDPs-based on combining expert advice or perturbing optimal policies [3], [9]-based on this nominal model. After a long time, this policy assigns a high probability to taking the 'left' action.…”
Section: Examplesmentioning
confidence: 99%
See 1 more Smart Citation