Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Jin, Yujia; Sidford, Aaron

doi:10.48550/arxiv.2106.07046

2021

DOI: 10.48550/arxiv.2106.07046

|View full text |Cite

Preprint

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Yujia Jin¹,

Aaron Sidford²

Abstract: We prove new upper and lower bounds for sample complexity of finding an -optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most t mix , we provide an algorithm that solves the problem using O(t mix −3 ) (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on t mix is necessary in the worst case for any algor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2022

Publication Types

Select...

Other1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(1 citation statement)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another popular performance measure is the sample complexity, which is the amount of data required to learn a near-optimal policy; see e.g Brunskill and Li (2014),Jin and Sidford (2021),Wang (2017)…”

mentioning

confidence: 99%

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Gao¹

2022

Preprint

View full text Add to dashboard Cite

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuoustime process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.

show abstract

mentioning

confidence: 99%

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Gao¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Cited by 1 publication

References 19 publications

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Contact Info

Product

Resources

About