2020
DOI: 10.1145/3379476
|View full text |Cite
|
Sign up to set email alerts
|

Non-Asymptotic Analysis of Monte Carlo Tree Search

Abstract: In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP) with deterministic transitions. While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. [5,6], the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Tree… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(14 citation statements)
references
References 0 publications
0
14
0
Order By: Relevance
“…When the base algorithm has convergence guarantees, such as UCRL, we can additionally provide guarantees on the rate of convergence. We provide these rates and a discussion of the UCRL case in the appendix: our analysis drawns upon the analysis of convergence rates for Monte Carlo Tree Search from Shah et al (Shah, Xie, and Xu 2020).…”
Section: Brief Theoretical Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…When the base algorithm has convergence guarantees, such as UCRL, we can additionally provide guarantees on the rate of convergence. We provide these rates and a discussion of the UCRL case in the appendix: our analysis drawns upon the analysis of convergence rates for Monte Carlo Tree Search from Shah et al (Shah, Xie, and Xu 2020).…”
Section: Brief Theoretical Discussionmentioning
confidence: 99%
“…In the case where the learning algorithms under constraints have regret guarantees, such as UCRL, we can closely follow the techniques of Shah et al (Shah, Xie, and Xu 2020) to provide a concentration property. In this subsection we will first show this generally and then provide discussion for UCRL.…”
Section: Convergence When the Algorithms Have Guarantees On The Rates...mentioning
confidence: 99%
“…|I| k ] that maps joint observation, z, to joint action, a. Adapting the proof concept in [6] to our setting, the policy improvement can be shown by validating the following two properties and then iterating: (11) and, after generating an appropriate dataset, (ii) learning…”
Section: A Meta Self-improving Algorithmmentioning
confidence: 99%
“…MCTS frequently uses the Upper Confidence Bound for Trees algorithm [4] that uses a discrete-action, multi-armed bandit solution [5] to balance exploration and exploitation in node selection. Recent work uses a non-stationary bandit analysis to propose a polynomial, rather than logarithmic, exploration term [6].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation