Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor

Hansen, Thomas Dueholm; Miltersen, Peter Bro; Zwick, Uri

doi:10.1145/2432622.2432623

Cited by 75 publications

(81 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[11] and [24] improved the iteration bound to O( Recent developments [18,19] showed that linear programs can be solved inÕ( rank(A)) number of linear system solves, which, applied to DMDP, leads to a running time ofÕ(|S| 2.5 |A|L) and O(|S| 2.5 |A| log(M/((1 − γ) ))) (see Appendix B for a derivation).…”

Section: Previous Workmentioning

confidence: 99%

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

Sidford

Wang

et al. 2018

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

In this paper we provide faster algorithms for approximately solving discounted Markov Decision Processes in multiple parameter regimes. Given a discounted Markov Decision Process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M ], we show how to compute an -optimal policy, with probability 1 − δ in timeThis contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDP's for intermediate values of γ.We also show how to obtain improved sublinear time algorithms and provide an algorithm which computes an -optimal policy with probability 1 − δ in time(1 − γ) 4 2 log 1 δ provided we can sample from the transition function in O(1) time. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.

show abstract

Section: Previous Workmentioning

confidence: 99%

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

Sidford

Wang

et al. 2018

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

show abstract

“…The paper [7] solves a considerable generalization of the Markov decision problem with complexity similar to that of [15]. Here one assumes that A and c are as in a Markov decision problem.…”

Section: Theorem 3 If the Optimal Value Of The Linear Program L P(a) mentioning

confidence: 99%

“…Section 3 discusses the Markov decision problem. The complexity results of [15] and [7] have the term 1 − γ in the denominator. This leads us to look for equivalent Markov decision problems for which this term is as large as possible.…”

mentioning

confidence: 99%

Efficient computation of a canonical form for a matrix with the generalized P-property

Morris

2014

Math. Program.

View full text Add to dashboard Cite

show abstract

“…For total-and average-reward MDPs, the largest known lower bound is also exponential and has recently been found by Fearnley through a carefully built family of examples [6], based on a construction for parity games that was proposed by Friedmann [9]. This was a breakthrough after more than 25 years of research on the question of the complexity of PI [11], [15]. The story seems different though for discounted-reward MDPs for which a strongly polynomial upper bound has recently been found by Ye [20], yet only for a fixed discount factor; PI is shown to run in at most n 2 (k−1) 1−λ · log n 2 1−λ iterations in that case.…”

Section: Introductionmentioning

confidence: 99%

“…The story seems different though for discounted-reward MDPs for which a strongly polynomial upper bound has recently been found by Ye [20], yet only for a fixed discount factor; PI is shown to run in at most n 2 (k−1) 1−λ · log n 2 1−λ iterations in that case. This bound was later improved by a factor n by Hansen et al [11] and adapted to two-player turn-based zero-sum games, a natural two-player extension of MDPs for which PI also applies. Thereby, they provided the first (strongly) polynomial time algorithm for this latter class of problems.…”

Section: Introductionmentioning

confidence: 99%

The complexity of Policy Iteration is exponential for discounted Markov Decision Processes

Hollanders

Delvenne

Jungers

2012

2012 IEEE 51st IEEE Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

The question of knowing whether the Policy Iteration algorithm (PI) for solving stationary Markov Decision Processes (MDPs) has exponential or (strongly) polynomial complexity has attracted much attention in the last 25 years. Recently, a family of examples on which PI requires an exponential number of iterations to converge was proposed for the total-reward and the average-reward criteria. On the other hand, it was shown that PI runs in strongly polynomial time on discounted-reward MDPs, yet only when the discount factor is fixed beforehand.In this work, we show that PI needs an exponential number of steps to converge on discounted-reward MDPs with a general discount factor.

show abstract

Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor

Cited by 75 publications

References 35 publications

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

Efficient computation of a canonical form for a matrix with the generalized P-property

The complexity of Policy Iteration is exponential for discounted Markov Decision Processes

Contact Info

Product

Resources

About