2004
DOI: 10.1023/b:mach.0000019802.64038.6c
|View full text |Cite
|
Sign up to set email alerts
|

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Abstract: Abstract. We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations rese… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
38
0
3

Year Published

2008
2008
2023
2023

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 79 publications
(41 citation statements)
references
References 32 publications
0
38
0
3
Order By: Relevance
“…The optimization criterion of infinite-horizon average-cost has been studied in the past two decades [20,10,13,21]; it was used for policy search [3], and successfully applied to gait optimization [23]. Local methods of optimization that use a simultaneous representation include multiple shooting [5] and space-time constraints [26], and this approach has been applied to gait design [25,15].…”
Section: Related Workmentioning
confidence: 99%
“…The optimization criterion of infinite-horizon average-cost has been studied in the past two decades [20,10,13,21]; it was used for policy search [3], and successfully applied to gait optimization [23]. Local methods of optimization that use a simultaneous representation include multiple shooting [5] and space-time constraints [26], and this approach has been applied to gait design [25,15].…”
Section: Related Workmentioning
confidence: 99%
“…Q-P -Learning: Q-P -Learning (Gosavi, 2004b(Gosavi, , 2003 follows the scheme of modified PI in which a policy is chosen, its value function is estimated (PE), and using the value function, a better policy is selected (policy improvement). This continues until the algorithm cannot produce a better policy.…”
Section: Reinforcement Learning With Q-valuesmentioning
confidence: 99%
“…The model they learn is able to capture the complex dynamics of the AGV problem. A well-known "revenue management problem" can be set up as an average-reward SMDP (Gosavi, 2004b;. But it has a unique reward structure with much of the reward concentrated in certain states that makes SMART, which is TD(0), unstable.…”
Section: Semi-markov Decision Problemsmentioning
confidence: 99%
See 1 more Smart Citation
“…The answer to this question might pave the way to solving the MDP without strict adherence to Bellman principles. Gosavi (2004a) has shown that that for any given state, if the absolute value of the error in the value function is less than half of the absolute value of the difference between the Q-value of the optimal function and the Q-value of the sub-optimal action (assuming we have 2 actions in each state), then that error can be tolerated. But an in-depth study of this issue may prove to be of importance in the future -especially in the context of function approximation, where we have clear deviation from Bellman optimality.…”
Section: Is Bellman Optimality Worth Achieving?mentioning
confidence: 99%