In this paper we provide faster algorithms for approximately solving discounted Markov decision processes in multiple parameter regimes. Given a discounted Markov decision process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M], we show how to compute an ϵ‐optimal policy, with probability 1 − δ in time (Note: We use trueO˜$$ \tilde{O} $$ to hide polylogarithmic factors in the input parameters, that is, trueO˜(f(x))=O(f(x)⋅logfalse(ffalse(xfalse)false)O(1))$$ \tilde{O}\left(f(x)\right)=O\left(f(x)\cdot \log {\left(f(x)\right)}^{O(1)}\right) $$.) trueO˜()()|S|2|A|+false|Sfalse‖Afalse|(1−γ)3log()Mϵlog()1δ.$$ \tilde{O}\left(\left({\left|S\right|}^2\mid A\mid +\frac{\mid S\Big\Vert A\mid }{{\left(1-\gamma \right)}^3}\right)\log \left(\frac{M}{\epsilon}\right)\log \left(\frac{1}{\delta}\right)\right). $$ This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of γ. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in O(1) time. Under this assumption we provide an algorithm which computes an ϵ‐optimal policy for ϵ∈(]0,M1−γ$$ \epsilon \in \left(0,\frac{M}{\sqrt{1-\gamma }}\right] $$ with probability 1 − δ in time trueO˜()false|Sfalse‖Afalse|M2false(1−γfalse)4ϵ2log()1δ.$$ \tilde{O}\left(\frac{\mid S\Big\Vert A\mid {M}^2}{{\left(1-\gamma \right)}^4{\epsilon}^2}\log \left(\frac{1}{\delta}\right)\right). $$ Furthermore, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed‐horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.
In this paper we provide faster algorithms for approximately solving discounted Markov Decision Processes in multiple parameter regimes. Given a discounted Markov Decision Process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M ], we show how to compute an -optimal policy, with probability 1 − δ in timeThis contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDP's for intermediate values of γ.We also show how to obtain improved sublinear time algorithms and provide an algorithm which computes an -optimal policy with probability 1 − δ in time(1 − γ) 4 2 log 1 δ provided we can sample from the transition function in O(1) time. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.
Nitrogen dioxide (NO) not only is linked to adverse effects on the respiratory system but also contributes to the formation of ground-level ozone (O) and fine particulate matter (PM). Our curbside monitoring data analysis in Detroit, MI, and Atlanta, GA, strongly suggests that a large fraction of NO is produced during the "tailpipe-to-road" stage. To substantiate this finding, we designed and carried out a field campaign to measure the same exhaust plumes at the tailpipe-level by a portable emissions measurement system (PEMS) and at the on-road level by an electric vehicle-based mobile platform. Furthermore, we employed a turbulent reacting flow model, CTAG, to simulate the on-road chemistry behind a single vehicle. We found that a three-reaction (NO-NO-O) system can largely capture the rapid NO to NO conversion (with time scale ≈ seconds) observed in the field studies. To distinguish the contributions from different mechanisms to near-road NO, we clearly defined a set of NO/NO ratios at different plume evolution stages, namely tailpipe, on-road, curbside, near-road, and ambient background. Our findings from curbside monitoring, on-road experiments, and simulations imply the on-road oxidation of NO by ambient O is a significant, but so far ignored, contributor to curbside and near-road NO.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.