We study the computational complexity of central analysis problems for One-Counter Markov Decision Processes (OC-MDPs), a class of finitely-presented, countable-state MDPs. OC-MDPs extend finite-state MDPs with an unbounded counter. The counter can be incremented, decremented, or not changed during each state transition, and transitions may be enabled or not depending on both the current state and on whether the counter value is 0 or not. Some states are "random", from where the next transition is chosen according to a given probability distribution, while other states are "controlled", from where the next transition is chosen by the controller. Different objectives for the controller give rise to different computational problems, aimed at computing optimal achievable objective values and optimal strategies. OC-MDPs are in fact equivalent to a controlled extension of (discrete-time) Quasi-Birth-Death processes (QBDs), a purely stochastic model heavily studied in queueing theory and applied probability. They can thus be viewed as a natural "adversarial" extension of a classic stochastic model. They can also be viewed as a natural probabilistic/controlled extension of classic onecounter automata. OC-MDPs also subsume (as a very restricted special case) a recently studied MDP model called "solvency games" that model a risk-averse gambling scenario. Basic computational questions for OC-MDPs include "termination" questions and "limit" questions, such as the following: does the controller have a strategy to ensure that the counter (which may, for example, count the number of jobs in the queue) will hit value 0 (the empty queue) almost surely (a.s.)? Or that the counter will have lim sup value ∞, a.s.? Or, that it will hit value 0 in a selected terminal state, a.s.? Or, in case such properties are not satisfied almost surely, compute their optimal probability over all strategies. We provide new upper and lower bounds on the complexity of such problems. Specifically, we show that several quantitative and almost-sure limit problems can be answered in polynomial time, and that almost-sure termination problems (without selection of desired terminal states) can also be answered in polynomial time. On the other hand, we show that the almost-sure termination problem with selected terminal states is PSPACE-hard and we provide an exponential time algorithm for this problem. We also characterize classes of strategies that suffice for optimality in several of these settings. Our upper bounds combine a number of techniques from the theory of MDP reward models, the theory of random walks, and a variety of automata-theoretic methods.given probability. Such questions are similar in spirit to questions asked in the rich literature on "adversarial queueing theory" (see, e.g., [4]), although this is a somewhat different setting. These considerations lead naturally to the extension of QBDs with control, and thus to OC-MDPs. Indeed, MDP variants of QBDs have already been studied in the stochastic modeling literature, see [27,19]. However, ...
We study Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) functions. We consider two different objectives, namely, expectation and satisfaction objectives. Given an MDP with k k k reward functions, in the expectation objective the goal is to maximize the expected limit-average value, and in the satisfaction objective the goal is to maximize the probability of runs such that the limit-average value stays above a given vector. We show that under the expectation objective, in contrast to the single-objective case, both randomization and memory are necessary for strategies, and that finite-memory randomized strategies are sufficient. Under the satisfaction objective, in contrast to the single-objective case, infinite memory is necessary for strategies, and that randomized memoryless strategies are sufficient for ε ε ε-approximation, for all ε > 0 ε > 0 ε > 0. We further prove that the decision problems for both expectation and satisfaction objectives can be solved in polynomial time and the trade-off curve (Pareto curve) can be ε ε ε-approximated in time polynomial in the size of the MDP and
Abstract. We study Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) functions. We consider two different objectives, namely, expectation and satisfaction objectives. Given an MDP with k k k limit-average functions, in the expectation objective the goal is to maximize the expected limit-average value, and in the satisfaction objective the goal is to maximize the probability of runs such that the limit-average value stays above a given vector. We show that under the expectation objective, in contrast to the case of one limit-average function, both randomization and memory are necessary for strategies even for ε ε ε-approximation, and that finite-memory randomized strategies are sufficient for achieving Pareto optimal values. Under the satisfaction objective, in contrast to the case of one limit-average function, infinite memory is necessary for strategies achieving a specific value (i.e. randomized finite-memory strategies are not sufficient), whereas memoryless randomized strategies are sufficient for ε ε ε-approximation, for all ε > 0 ε > 0 ε > 0. We further prove that the decision problems for both expectation and satisfaction objectives can be solved in polynomial time and the trade-off curve (Pareto curve) can be ε ε ε-approximated in time polynomial in the size of the MDP and, and exponential in the number of limitaverage functions, for all ε > 0 ε > 0 ε > 0. Our analysis also reveals flaws in previous work for MDPs with multiple mean-payoff functions under the expectation objective, corrects the flaws, and allows us to obtain improved results. ACM CCS: [Mathematics
We consider a class of infinite-state Markov decision processes generated by stateless pushdown automata. This class corresponds to 1 1 2player games over graphs generated by BPA systems or (equivalently) 1-exit recursive state machines. An extended reachability objective is specified by two sets S and T of safe and terminal stack configurations, where the membership to S and T depends just on the top-of-the-stack symbol. The question is whether there is a suitable strategy such that the probability of hitting a terminal configuration by a path leading only through safe configurations is equal to (or different from) a given x ∈ {0, 1}. We show that the qualitative extended reachability problem is decidable in polynomial time, and that the set of all configurations for which there is a winning strategy is effectively regular. More precisely, this set can be represented by a deterministic finite-state automaton with a fixed number of control states. This result is a generalization of a recent theorem by Etessami & Yannakakis which says that the qualitative termination for 1-exit RMDPs (which exactly correspond to our 1 1 2-player BPA games) is decidable in polynomial time. Interestingly, the properties of winning strategies for the extended reachability objectives are quite different from the ones for termination, and new observations are needed to obtain the result. As an application, we derive the EXPTIME-completeness of the model-checking problem for 1 1 2-player BPA games and qualitative PCTL formulae.
We consider stochastic turn-based
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.