A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Gosavi, Abhijit

doi:10.1023/b:mach.0000019802.64038.6c

Cited by 79 publications

(41 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The optimization criterion of infinite-horizon average-cost has been studied in the past two decades [20,10,13,21]; it was used for policy search [3], and successfully applied to gait optimization [23]. Local methods of optimization that use a simultaneous representation include multiple shooting [5] and space-time constraints [26], and this approach has been applied to gait design [25,15].…”

Section: Related Workmentioning

confidence: 99%

Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts

Erez

Tassa

Todorov

2011

Robotics: Science and Systems VII

View full text Add to dashboard Cite

Abstract-We present a method that combines offline trajectory optimization and online Model Predictive Control (MPC), generating robust controllers for complex periodic behavior in domains with unilateral constraints (e.g., contact with the environment). MPC offers robust and adaptive control even in high-dimensional domains; however, the online optimization gets stuck in local minima when the domains has discontinuous dynamics. Some methods of trajectory optimization that are immune to such problems, but these are often too slow to be applied online.In this paper, we use offline optimization to find the limit-cycle solution of an infinite-horizon average-cost optimal-control task. We then compute a local quadratic approximation of the Value function around this limit cycle. Finally, we use this quadratic approximation as the terminal cost of an online MPC.This combination of an offline solution of the infinite-horizon problem with an online MPC controller is known as Infinite Horizon Model Predictive Control (IHMPC), and has previously been applied only to simple stabilization objectives. Here we extend IHMPC to tackle periodic tasks, and demonstrate the power of our approach by synthesizing hopping behavior in a simulated robot. IHMPC involves a limited computational load, and can be executed online on a standard laptop computer. The resulting behavior is extremely robust, allowing the hopper to recover from virtually any perturbation.In real robotic domains, modeling errors are inevitable. We show how IHMPC is robust to modeling errors by altering the morphology of the robot; the same controller remains effective, even when the underlying infinite-horizon solution is no longer accurate.

show abstract

Section: Related Workmentioning

confidence: 99%

Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts

Erez

Tassa

Todorov

2011

Robotics: Science and Systems VII

View full text Add to dashboard Cite

show abstract

“…Q-P -Learning: Q-P -Learning (Gosavi, 2004b(Gosavi, , 2003 follows the scheme of modified PI in which a policy is chosen, its value function is estimated (PE), and using the value function, a better policy is selected (policy improvement). This continues until the algorithm cannot produce a better policy.…”

Section: Reinforcement Learning With Q-valuesmentioning

confidence: 99%

“…The model they learn is able to capture the complex dynamics of the AGV problem. A well-known "revenue management problem" can be set up as an average-reward SMDP (Gosavi, 2004b;. But it has a unique reward structure with much of the reward concentrated in certain states that makes SMART, which is TD(0), unstable.…”

Section: Semi-markov Decision Problemsmentioning

confidence: 99%

“…But it has a unique reward structure with much of the reward concentrated in certain states that makes SMART, which is TD(0), unstable. Hence Q-P -Learning (Gosavi, 2004b) and λ-SMART were applied. The work related to hyperheuristics (Burke et al, 2003a,b;Nareyek, 2003) can be used when RL is to be used dynamically to select a meta-heuristic.…”

Section: Semi-markov Decision Problemsmentioning

confidence: 99%

See 1 more Smart Citation

Reinforcement Learning: A Tutorial Survey and Recent Advances

Gosavi

2009

INFORMS Journal on Computing

Self Cite

258

View full text Add to dashboard Cite

In the last few years, Reinforcement Learning (RL), also called adaptive (or approximate) dynamic programming (ADP), has emerged as a powerful tool for solving complex sequential decision-making problems in control theory. Although seminal research in this area was performed in the artificial intelligence (AI) community, more recently, it has attracted the attention of optimization theorists because of several noteworthy success stories from operations management. It is on large-scale and complex problems of dynamic optimization, in particular the Markov decision problem (MDP) and its variants, that the power of RL becomes more obvious. It has been known for many years that on large-scale MDPs, the curse of dimensionality and the curse of modeling render classical dynamic programming (DP) ineffective. The excitement in RL stems from its direct attack on these curses, allowing it to solve problems that were considered intractable, via classical DP, in the past. The success of RL is due to its strong mathematical roots in the principles of DP, Monte Carlo simulation, function approximation, and AI. Topics treated in some detail in this survey are: Temporal differences, Q-Learning, semi-MDPs and stochastic games. Several recent advances in RL, e.g., policy gradients and hierarchical RL, are covered along with references. Pointers to numerous examples of applications are provided. This overview is aimed at uncovering the mathematical roots of this science, so that readers gain a clear understanding of the core concepts and are able to use them in their own research. The survey points to more than 100 references from the literature.

show abstract

“…The answer to this question might pave the way to solving the MDP without strict adherence to Bellman principles. Gosavi (2004a) has shown that that for any given state, if the absolute value of the error in the value function is less than half of the absolute value of the difference between the Q-value of the optimal function and the Q-value of the sub-optimal action (assuming we have 2 actions in each state), then that error can be tolerated. But an in-depth study of this issue may prove to be of importance in the future -especially in the context of function approximation, where we have clear deviation from Bellman optimality.…”

Section: Is Bellman Optimality Worth Achieving?mentioning

confidence: 99%

On step sizes, stochastic shortest paths, and survival probabilities in Reinforcement Learning

Gosavi

2008

2008 Winter Simulation Conference

Self Cite

View full text Add to dashboard Cite

Reinforcement Learning (RL) is a simulation-based technique useful in solving Markov decision processes if their transition probabilities are not easily obtainable or if the problems have a very large number of states. We present an empirical study of (i) the effect of step-sizes (learning rules) in the convergence of RL algorithms, (ii) stochastic shortest paths in solving average reward problems via RL, and (iii) the notion of survival probabilities (downside risk) in RL. We also study the impact of step sizes when function approximation is combined with RL. Our experiments yield some interesting insights that will be useful in practice when RL algorithms are implemented within simulators.

show abstract

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Cited by 79 publications

References 32 publications

Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts

Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts

Reinforcement Learning: A Tutorial Survey and Recent Advances

On step sizes, stochastic shortest paths, and survival probabilities in Reinforcement Learning

Contact Info

Product

Resources

About