2012 IEEE 51st IEEE Conference on Decision and Control (CDC) 2012
DOI: 10.1109/cdc.2012.6426485
|View full text |Cite
|
Sign up to set email alerts
|

The complexity of Policy Iteration is exponential for discounted Markov Decision Processes

Abstract: The question of knowing whether the Policy Iteration algorithm (PI) for solving stationary Markov Decision Processes (MDPs) has exponential or (strongly) polynomial complexity has attracted much attention in the last 25 years. Recently, a family of examples on which PI requires an exponential number of iterations to converge was proposed for the total-reward and the average-reward criteria. On the other hand, it was shown that PI runs in strongly polynomial time on discounted-reward MDPs, yet only when the dis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
4

Relationship

2
6

Authors

Journals

citations
Cited by 14 publications
(10 citation statements)
references
References 17 publications
0
10
0
Order By: Relevance
“…As pointed out in Scherrer et al (2016), the lower-bound complexity of PI is considered an open problem, at least in the most general MDP formulation. Lower-bounds have been derived in specific cases only, such as deterministic MDPs (Hansen & Zwick, 2010), total reward criterion (Fearnley, 2010) or high discount factor (Hollanders et al, 2012). Even though we did not intend to directly address this open question, our lower bound result seems to be a contribution on its own to the general theory of non-delayed MDPs.…”
Section: Mdps With Delay: a Degradation Examplementioning
confidence: 98%
“…As pointed out in Scherrer et al (2016), the lower-bound complexity of PI is considered an open problem, at least in the most general MDP formulation. Lower-bounds have been derived in specific cases only, such as deterministic MDPs (Hansen & Zwick, 2010), total reward criterion (Fearnley, 2010) or high discount factor (Hollanders et al, 2012). Even though we did not intend to directly address this open question, our lower bound result seems to be a contribution on its own to the general theory of non-delayed MDPs.…”
Section: Mdps With Delay: a Degradation Examplementioning
confidence: 98%
“…In particular, even if the time-complexity of the value iteration (VI) algorithm for convergence in terms of the number of iterations is polynomial in |X|, |A|, 1/(1 − γ ), and the size of representing the inputs R and P, the dependence on 1/(1 − γ ) is a major drawback (Blondel & Tsitsiklis, 2000). On the other hand, PI's time-complexity for convergence is known to be exponential in general (Hollanders, Delvenne, & Jungers, 2012) even if it is strongly polynomial when γ is fixed (Ye, 2011). Note that the per-iteration computational complexity of VI is O(|A||X | 2 ) and that of PI is O(|X | 3 + |A||X| 2 ).…”
Section: Introductionmentioning
confidence: 95%
“…Even tighter upper bounds (still exponential in n) have been shown for k = 2 [6]. Interestingly, the only lower bounds that have been shown for PI are either for the special case of k = 2 [8] [9] or when k is related to n [10] [11]. We contribute lower bounds for arbitrary n ≥ 2, k ≥ 2.…”
Section: Introductionmentioning
confidence: 97%