2019 IEEE Conference on Games (CoG) 2019
DOI: 10.1109/cig.2019.8848037
|View full text |Cite
|
Sign up to set email alerts
|

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Abstract: In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 25 publications
(51 reference statements)
0
11
0
Order By: Relevance
“…Consequently, a computationally heavy process is run just once (offline) and then this time-efficient problem representation can be used in subsequent online applications. The approach of combining an MCTS trainer with a fast learning-based representation can be hybridised in various ways, specific to particular problem / domain of interest (Guo et al, 2014;Kartal et al, 2019a;Soemers et al, 2019).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Consequently, a computationally heavy process is run just once (offline) and then this time-efficient problem representation can be used in subsequent online applications. The approach of combining an MCTS trainer with a fast learning-based representation can be hybridised in various ways, specific to particular problem / domain of interest (Guo et al, 2014;Kartal et al, 2019a;Soemers et al, 2019).…”
Section: Discussionmentioning
confidence: 99%
“…-where P (m i ) is the output from the neural network trained on human data; C BT is a weight of how the bias blends with the UCT score and K is a parameter controlling the rate at which the bias decreases. Soemers et al (2019) show that is possible to learn a policy in a MPD using the policy gradient method and value estimates directly from the MCTS algorithm. Kartal et al (2019a) propose a method to combine deep reinforcement learning and MCTS, where the latter acts as a demonstrator for the RL component.…”
Section: Mimicking Human Playmentioning
confidence: 99%
“…It is based on MCTS in which the simulation phase is replaced by a deep RL model that acts in the environment and chooses actions according to its policy rather than randomly. Soemers et al (2019) show that is possible to learn a policy in a MPD using the policy gradient method and value estimates directly from the MCTS algorithm. propose a method to combine deep RL and MCTS, where the latter acts as a demonstrator for the RL component.…”
Section: Alphago Inspired Approachesmentioning
confidence: 99%
“…Therefore, the first 10 moves will contain MCTS's exploration, and the rest will feature only the most-visited action. We note that there exists research on this area, with a focus on removing exploration elements from MCTS policy targets with the hope of aiding interpretability [32].…”
Section: Hyperparameters and General Performance Improvementsmentioning
confidence: 99%