Biasing MCTS with Features for General Games

Soemers, Dennis J. N. J.; Piette, Éric; Browne, Cameron

doi:10.1109/cec.2019.8790141

Cited by 10 publications

(9 citation statements)

References 56 publications

(147 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When offline training is used to train policies, for instance based on deep neural networks [16] or simpler function approximators and state-action features [19], it is also customary to use such a distribution with τ = 1 (leading to a softmax distribution) and the Q(m) values referred to as logits. Let M denote a set of legal moves, and let I denote a set of moves as generated during a filter or no-repetition playout (which may include some illegal moves), such that M ⊆ I.…”

Section: Non-uniform Move Distributionsmentioning

confidence: 99%

Optimised Playout Implementations for the Ludii General Game System

Soemers¹,

Piette²,

Stephenson³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper describes three different optimised implementations of playouts, as commonly used by game-playing algorithms such as Monte-Carlo Tree Search. Each of the optimised implementations is applicable only to specific sets of games, based on their rules. The Ludii general game system can automatically infer, based on a game's description in its general game description language, whether any optimised implementations are applicable. An empirical evaluation demonstrates major speedups over a standard implementation, with a median result of running playouts 5.08 times as fast, over 145 different games in Ludii for which one of the optimised implementations is applicable.

show abstract

Section: Non-uniform Move Distributionsmentioning

confidence: 99%

Optimised Playout Implementations for the Ludii General Game System

Soemers¹,

Piette²,

Stephenson³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Policies use local patterns [26] as binary features for state-action pairs. We start every training run with a limited set of "atomic" features, and add one feature to every feature set after every full game of self-play [27]. Because we include asymmetric games, we use separate feature sets, separate experience buffers, and train separate feature weights, per player number (or colour).…”

Section: A Setupmentioning

confidence: 99%

Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

Soemers

Piette

Stephenson

et al. 2020

2020 IEEE Conference on Games (CoG)

Self Cite

View full text Add to dashboard Cite

Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm -such as Monte-Carlo tree search -and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.

show abstract

“…Note that the features that detect losing moves can be viewed as more "general" features, in the sense that they will also always be active in situations where the win-detecting feature is active. When the set of features is automatically grown over time during self-play, and more "specific" features are constructed by combining multiple more "general" features [29], the loss-detecting features are often discovered before the windetecting features. These features are -as expected -quickly associated with negative weights, resulting in low probabilities π(s, a) ≈ 0 of playing actions a in which loss-detecting features are active.…”

Section: A Gradients For Low-probability Actionsmentioning

confidence: 99%

“…Updates are performed using a centered variant of RM-SProp [32], with a base learning rate of 0.005, a momentum of 0.9, a discounting factor of 0.9, and a constant of 10 −8 added to the denominator for stability. After every full game of self-play, we add a new feature to the set of features [29].…”

Section: A Setupmentioning

confidence: 99%

“…Probabilities π(s, a) are subsequently computed using the softmax function; π(s, a) = exp(z(s,a)) a exp(z(s,a )) . In preliminary testing, we found that there is a risk for strong features that are only discovered and added in the middle of a self-play training process [29] to remain unused. When this happens, it appears like the learning approach remains stuck in what used to be a local optimum given an older feature set, even though newly-added features should enable escaping that local optimum.…”

Section: Learning Offsets From Exploratory Policymentioning

confidence: 99%

See 1 more Smart Citation

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Soemers

Piette

Stephenson

et al. 2019

2019 IEEE Conference on Games (CoG)

Self Cite

View full text Add to dashboard Cite

In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games.

show abstract

Biasing MCTS with Features for General Games

Cited by 10 publications

References 56 publications

Optimised Playout Implementations for the Ludii General Game System

Optimised Playout Implementations for the Ludii General Game System

Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Contact Info

Product

Resources

About