A New Value Iteration method for the Average Cost Dynamic Programming Problem

Bertsekas, Dimitri P.

doi:10.1137/s0363012995291609

Cited by 38 publications

(35 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SSP Q-learning is based on the observation that the average cost under any stationary policy is simply the ratio of expected total cost and expected time between two successive visits to the reference state s. This connection was exploited by Bertsekas in [5] to give a new algorithm for computing V (·), which we describe below.…”

Section: Ssp Q-learningmentioning

confidence: 99%

See 1 more Smart Citation

Learning Algorithms for Markov Decision Processes with Average Cost

Abounadi

Bertsekas

Borkar

2001

SIAM J. Control Optim.

163

200

View full text Add to dashboard Cite

Abstract. This paper gives the first rigorous convergence analysis of analogues of Watkins's Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of the average cost problem-the traditional relative value iteration (RVI) algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and analyzed using the ODE method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two-time-scale stochastic approximation.Key words. simulation-based algorithms, Q-learning, controlled Markov chains, average cost control, stochastic approximation, dynamic programming AMS subject classification. 93E20 PII. S03630129993619741. Introduction. Q-learning algorithms are simulation-based reinforcement learning algorithms for learning the value function arising in the dynamic programming approach to Markov decision processes. They were first introduced for the discounted cost problem by Watkins [27] and analyzed partially in Watkins [27] and then in Watkins and Dayan [28]. A more comprehensive analysis was given by Tsitsiklis [25] (also reproduced in Bertsekas and Tsitsiklis [7]), which made the connection between Q-learning and stochastic approximation. (See also Jaakola, Jordan, and Singh [15] for a parallel treatment, which made the connection between TD(λ) and stochastic approximation.) In particular, Q-learning algorithms for discounted cost problems or stochastic shortest path (SSP) problems were viewed as stochastic approximation variants of well-known value iteration algorithms in dynamic programming.These techniques, however, do not extend automatically to the average cost problem, which is harder to analyze even when the model (i.e., controlled transition probabilities) is readily available. Not surprisingly, the corresponding developments for the average cost problem have been slower. One of the first was the "R-learning" algorithm proposed by Schwartz [22]. This is a two-time-scale algorithm that carries out a value iteration-type step to update values of state-action pairs and updates concurrently an estimate of the optimal average cost using the immediate reward along with an adjustment factor. The idea is to obtain a good estimate for the average cost while searching for the optimal policy using a value iteration-type update. Although Schwartz presents some intuitive arguments to justify his algorithm along with some

show abstract

Section: Ssp Q-learningmentioning

confidence: 99%

“…This is the algorithm of [5], wherein the first "fast" iteration sees λ k as quasi-static (b(k)'s are "small") and thus tracks V λ k (·), while the second "slow" iteration gradually guides λ k to the desired value.…”

Section: Ssp Q-learningmentioning

confidence: 99%

Learning Algorithms for Markov Decision Processes with Average Cost

Abounadi

Bertsekas

Borkar

2001

SIAM J. Control Optim.

163

200

View full text Add to dashboard Cite

show abstract

“…Other algorithms such as policy iteration or a hybrid algorithm are also frequently used [4], [20], [28]. However, it is arguably this result on running times that makes value iteration the basis of most practical algorithms for large-scale MDPs with many states.…”

Section: Basics Of Mdpsmentioning

confidence: 99%

“…The key in this approach is that the separation oracle, model (4), is a standard unconstrained MDP that we can solve in time VI using value iteration. In model (4), we write the optimization over X instead of ext(X).…”

Section: An Ellipsoid Algorithm Approachmentioning

confidence: 99%

Efficient Algorithms for Budget-Constrained Markov Decision Processes

Caramanis

Dimitrov

Morton

2014

IEEE Trans. Automat. Contr.

View full text Add to dashboard Cite

Abstract-Discounted, discrete-time, discrete state-space, discrete action-space Markov decision processes (MDPs) form a classical topic in control, game theory, and learning, and as a result are widely applied, increasingly, in very large-scale applications. Many algorithms have been developed to solve large-scale MDPs. Algorithms based on value iteration are particularly popular, as they are more efficient than the generic linear programming approach, by an order of magnitude in the number of states of the MDP. Yet in the case of budget constrained MDPs, no more efficient algorithm than linear programming is known. The theoretically slower running times of linear programming may limit the scalability of constrained MDPs piratically; while, theoretically, it invites the question of whether the increase is somehow intrinsic. In this paper we show that it is not, and provide two algorithms for budgetconstrained MDPs that are as efficient as value iteration. Denoting the running time of value iteration by VI, and the magnitude of the input by U , for an MDP with m expected budget constraints our first algorithm runs in time O(poly(m, log U ) · VI). Given a pre-specified degree of precision, η, for satisfying the budget constraints, our second algorithm runs in time O(log m·poly(log U )· 1 η 2 ·VI), but may produce solutions that overutilize each of the m budgets by a multiplicative factor of 1 + η. In fact, one can substitute value iteration with any algorithm, possibly specially designed for a specific MDP, that solves the MDP quickly to achieve similar theoretical guarantees. Both algorithms restrict attention to constrained infinite-horizon MDPs under discounted costs.

show abstract

“…All these algorithms use some form of value iteration. In SMART and Relaxed-SMART, a form of the average reward Bellman equation is directly used while the algorithm in Abounadi, Bertsekas, and Borkar uses a distinct updating scheme based on a form of value iteration given in Bertsekas (1995b) that has been proven to converge. Relaxed-SMART has been proven to converge in Gosavi (2003).…”

Section: Introductionmentioning

confidence: 99%

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Gosavi

2004

Machine Learning

View full text Add to dashboard Cite

Abstract. We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.

show abstract

A New Value Iteration method for the Average Cost Dynamic Programming Problem

Cited by 38 publications

References 8 publications

Learning Algorithms for Markov Decision Processes with Average Cost

Learning Algorithms for Markov Decision Processes with Average Cost

Efficient Algorithms for Budget-Constrained Markov Decision Processes

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Contact Info

Product

Resources

About