1998
DOI: 10.1137/s0363012995291609
|View full text |Cite
|
Sign up to set email alerts
|

A New Value Iteration method for the Average Cost Dynamic Programming Problem

Abstract: We propose a new value iteration method for the classical average cost Markovian Decision problem, under the assumption that all stationary policies are unichain and furthermore there exists a state that is recurrent under all stationary policies. This method is motivated by a relation between the average cost problem and an associated stochastic shortest path problem.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
34
0

Year Published

2001
2001
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(35 citation statements)
references
References 8 publications
1
34
0
Order By: Relevance
“…SSP Q-learning is based on the observation that the average cost under any stationary policy is simply the ratio of expected total cost and expected time between two successive visits to the reference state s. This connection was exploited by Bertsekas in [5] to give a new algorithm for computing V (·), which we describe below.…”
Section: Ssp Q-learningmentioning
confidence: 99%
See 1 more Smart Citation
“…SSP Q-learning is based on the observation that the average cost under any stationary policy is simply the ratio of expected total cost and expected time between two successive visits to the reference state s. This connection was exploited by Bertsekas in [5] to give a new algorithm for computing V (·), which we describe below.…”
Section: Ssp Q-learningmentioning
confidence: 99%
“…This is the algorithm of [5], wherein the first "fast" iteration sees λ k as quasi-static (b(k)'s are "small") and thus tracks V λ k (·), while the second "slow" iteration gradually guides λ k to the desired value.…”
Section: Ssp Q-learningmentioning
confidence: 99%
“…Other algorithms such as policy iteration or a hybrid algorithm are also frequently used [4], [20], [28]. However, it is arguably this result on running times that makes value iteration the basis of most practical algorithms for large-scale MDPs with many states.…”
Section: Basics Of Mdpsmentioning
confidence: 99%
“…The key in this approach is that the separation oracle, model (4), is a standard unconstrained MDP that we can solve in time VI using value iteration. In model (4), we write the optimization over X instead of ext(X).…”
Section: An Ellipsoid Algorithm Approachmentioning
confidence: 99%
“…All these algorithms use some form of value iteration. In SMART and Relaxed-SMART, a form of the average reward Bellman equation is directly used while the algorithm in Abounadi, Bertsekas, and Borkar uses a distinct updating scheme based on a form of value iteration given in Bertsekas (1995b) that has been proven to converge. Relaxed-SMART has been proven to converge in Gosavi (2003).…”
Section: Introductionmentioning
confidence: 99%