Online Markov Decision Processes

Even-Dar, Eyal; Kakade, Sham M.; Mansour, Yishay

doi:10.1287/moor.1090.0396

Cited by 141 publications

(288 citation statements)

References 18 publications

Supporting

Mentioning

279

Contrasting

Order By: Relevance

“…Our setting might seem reminiscent of (online) reinforcement learning models, and in particular, of online Markov Decision Processes (MDPs) (Even-Dar et al, 2009;Neu et al, 2010). In these models, there is typically a finite number of states, and the player's actions on each round cause him to transition from one state to another.…”

Section: Discussion and Related Workmentioning

confidence: 99%

Chasing Ghosts: Competing with Stateful Policies

Feige¹,

Koren²,

Tennenholtz³

2017

SIAM J. Comput.

View full text Add to dashboard Cite

We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so called "bandit" setting). If either the reference policies are stateless rather than stateful, or the feedback includes the rewards of all actions (the so called "expert" setting), previous work shows that the optimal regret grows like Θ( √ T ) in terms of the number of decision rounds T . The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies, and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total reward. Nevertheless, we design an algorithm that achieves expected regret that is sublinear in T , of the form O(T / log 1/4 T ). Our algorithm is based on a certain local repetition lemma that may be of independent interest. We also show that no algorithm can guarantee expected regret better than O(T / log 3/2 T ).

show abstract

Section: Discussion and Related Workmentioning

confidence: 99%

Chasing Ghosts: Competing with Stateful Policies

Feige¹,

Koren²,

Tennenholtz³

2017

SIAM J. Comput.

View full text Add to dashboard Cite

show abstract

“…EA-EMT is unlike other adaptive control algorithms based on expert ensembles, where experts directly produce actions or plans to be fused (e.g. [27,28,16]). 3 Rather, EA-EMT operates in two distinct modules: the expert-based model estimation and a control algorithm that utilises that model.…”

Section: Discussionmentioning

confidence: 99%

A hybrid controller based on the egocentric perceptual principle

Rabinovich¹,

Jennings²

2010

Robotics and Autonomous Systems

View full text Add to dashboard Cite

a b s t r a c tIn this paper we extend the control methodology based on Extended Markov Tracking (EMT) by providing the control algorithm with capabilities to calibrate and even partially reconstruct the environment's model. This enables us to resolve the problem of performance deterioration due to model incoherence, a problem faced in all model-based control methods. The new algorithm, Ensemble Actions EMT (EA-EMT), utilises the initial environment model as a library of state transition functions and applies a variation of prediction with experts to assemble and calibrate a revised model. By so doing, this is the first hybrid control algorithm that enables on-line adaptation within the egocentric control framework which dictates the control of an agent's perceptions, rather than an agent's environment state. In our experiments, we performed a range of tests with increasing model incoherence induced by three types of exogenous environment perturbations: catastrophic-the environment becomes completely inconsistent with the model, deviating-some aspect of the environment behaviour diverges compared to that specified in the model, and periodic-the environment alternates between several possible divergences. The results show that EA-EMT resolved model incoherence and significantly outperformed its EMT predecessor by up to 95%.

show abstract

“…The main lesson here is that off-line planning in the worst-case can scale exponentially with the dimensionality of the state space (Chow and Tsitsiklis, 1989), while online planning (i.e., planning for the "current state") can break the curse of dimensionality by amortizing the planning effort over multiple time steps (Rust, 1996;Szepesvári, 2001). Other topics of interest include the linear programming-based approaches (de Farias and Van Roy, 2003, 2006, dual dynamic programming (Wang et al, 2008), techniques based on sample average approximation (Shapiro, 2003) such as PEGASUS (Ng and Jordan, 2000), online learning in MDPs with arbitrary reward processes (Even-Dar et al, 2005;Neu et al, 2010), or learning with (almost) no restrictions in a competitive framework (Hutter, 2004). Other important topics include learning and acting in partially observed MDPs (for recent developments, see, e.g., Littman et al, 2001;Toussaint et al, 2008;, learning and acting in games or under some other optimization criteria (Littman, 1994;Heger, 1994;Szepesvári and Littman, 1999;Borkar and Meyn, 2002), or the development of hierarchical and multi-time-scale methods (Dietterich, 1998;Sutton et al, 1999b).…”

Section: Further Readingmentioning

confidence: 99%