2019 IEEE Congress on Evolutionary Computation (CEC) 2019
DOI: 10.1109/cec.2019.8790141
|View full text |Cite
|
Sign up to set email alerts
|

Biasing MCTS with Features for General Games

Abstract: This paper proposes using a linear function approximator, rather than a deep neural network (DNN), to bias a Monte Carlo tree search (MCTS) player for general games. This is unlikely to match the potential raw playing strength of DNNs, but has advantages in terms of generality, interpretability and resources (time and hardware) required for training. Features describing local patterns are used as inputs. The features are formulated in such a way that they are easily interpretable and applicable to a wide range… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
8
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
1

Relationship

3
3

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 56 publications
(147 reference statements)
1
8
0
Order By: Relevance
“…When offline training is used to train policies, for instance based on deep neural networks [16] or simpler function approximators and state-action features [19], it is also customary to use such a distribution with τ = 1 (leading to a softmax distribution) and the Q(m) values referred to as logits. Let M denote a set of legal moves, and let I denote a set of moves as generated during a filter or no-repetition playout (which may include some illegal moves), such that M ⊆ I.…”
Section: Non-uniform Move Distributionsmentioning
confidence: 99%
“…When offline training is used to train policies, for instance based on deep neural networks [16] or simpler function approximators and state-action features [19], it is also customary to use such a distribution with τ = 1 (leading to a softmax distribution) and the Q(m) values referred to as logits. Let M denote a set of legal moves, and let I denote a set of moves as generated during a filter or no-repetition playout (which may include some illegal moves), such that M ⊆ I.…”
Section: Non-uniform Move Distributionsmentioning
confidence: 99%
“…Policies use local patterns [26] as binary features for state-action pairs. We start every training run with a limited set of "atomic" features, and add one feature to every feature set after every full game of self-play [27]. Because we include asymmetric games, we use separate feature sets, separate experience buffers, and train separate feature weights, per player number (or colour).…”
Section: A Setupmentioning
confidence: 99%
“…Note that the features that detect losing moves can be viewed as more "general" features, in the sense that they will also always be active in situations where the win-detecting feature is active. When the set of features is automatically grown over time during self-play, and more "specific" features are constructed by combining multiple more "general" features [29], the loss-detecting features are often discovered before the windetecting features. These features are -as expected -quickly associated with negative weights, resulting in low probabilities π(s, a) ≈ 0 of playing actions a in which loss-detecting features are active.…”
Section: A Gradients For Low-probability Actionsmentioning
confidence: 99%
“…Updates are performed using a centered variant of RM-SProp [32], with a base learning rate of 0.005, a momentum of 0.9, a discounting factor of 0.9, and a constant of 10 −8 added to the denominator for stability. After every full game of self-play, we add a new feature to the set of features [29].…”
Section: A Setupmentioning
confidence: 99%
See 1 more Smart Citation