Temporal Difference Learning Versus Co-Evolution for Acquiring Othello Position Evaluation

Lucas, Simon M.; Rúnarsson, Thomas Philip

doi:10.1109/cig.2006.311681

Cited by 46 publications

(81 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Othello), which is a deterministic, perfect information, zero-sum game for two players, has been studied by the AI community [11,12,20,24,25,32]. The game's goal is to control a majority of the pieces at the end of the game by forcing as many of your opponent's pieces to be turned over on an 8 × 8 board as possible.…”

Section: Reversimentioning

confidence: 99%

“…Adding a noise to the evaluation is for the reason that we would like to collect a variety of game trajectories. The weight w i of HEUR is determined manually while that of COEV is optimized by a co-evolutionary computation method [11]. Every policy repeatedly played against every other, and then the state transitions were retrieved from the game trajectories of the winners.…”

Section: Reversimentioning

confidence: 99%

See 1 more Smart Citation

Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

Uchibe

2017

Neural Process Lett

View full text Add to dashboard Cite

This paper proposes model-free deep inverse reinforcement learning to find nonlinear reward function structures. We formulate inverse reinforcement learning as a problem of density ratio estimation, and show that the log of the ratio between an optimal state transition and a baseline one is given by a part of reward and the difference of the value functions under the framework of linearly solvable Markov decision processes. The logarithm of density ratio is efficiently calculated by binomial logistic regression, of which the classifier is constructed by the reward and state value function. The classifier tries to discriminate between samples drawn from the optimal state transition probability and those from the baseline one. Then, the estimated state value function is used to initialize the part of the deep neural networks for forward reinforcement learning. The proposed deep forward and inverse reinforcement learning is applied into two benchmark games: Atari 2600 and Reversi. Simulation results show that our method reaches the best performance substantially faster than the standard combination of forward and inverse reinforcement learning as well as behavior cloning.

show abstract

Section: Reversimentioning

confidence: 99%

Section: Reversimentioning

confidence: 99%

Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

Uchibe

2017

Neural Process Lett

View full text Add to dashboard Cite

show abstract

“…However, they find at least one setup, using coevolution, wherein evolution outperforms TD. They also present results for Othello [38], finding that TD methods are much faster but that a properly tuned evolutionary method ultimately performs best. Lucas and Togelius [39] present similar comparative results in a simple car-racing domain.…”

Section: Related Workmentioning

confidence: 99%

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Whiteson

Taylor

Stone

2009

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods' relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa's learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.

show abstract

“…Fortunately, the game rules are flexible enough to be easily adapted to smaller boards without loss of the underlying 'spirit' of the game, so in a great part of studies on computer Go the board is downgraded to 9 × 9 or 5 × 5. Following Lucas and Runarsson (2006) as well as Lubberts and Miikkulainen (2001), we consider playing Go on a 5 × 5 board (see Fig. 1).…”

Section: Adopted Computer Go Rulesmentioning

confidence: 99%

Evolving small-board Go players using coevolutionary temporal difference learning with archives

Krawiec¹,

Jaśkowski²,

Szubert³

2011

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

We apply Coevolutionary Temporal Difference Learning (CTDL) to learn small-board Go strategies represented as weighted piece counters. CTDL is a randomized learning technique which interweaves two search processes that operate in the intra-game and inter-game mode. Intra-game learning is driven by gradient-descent Temporal Difference Learning (TDL), a reinforcement learning method that updates the board evaluation function according to differences observed between its values for consecutively visited game states. For the inter-game learning component, we provide a coevolutionary algorithm that maintains a sample of strategies and uses the outcomes of games played between them to iteratively modify the probability distribution, according to which new strategies are generated and added to the sample. We analyze CTDL's sensitivity to all important parameters, including the trace decay constant that controls the lookahead horizon of TDL, and the relative intensity of intra-game and inter-game learning. We also investigate how the presence of memory (an archive) affects the search performance, and find out that the archived approach is superior to other techniques considered here and produces strategies that outperform a handcrafted weighted piece counter strategy and simple liberty-based heuristics. This encouraging result can be potentially generalized not only to other strategy representations used for small-board Go, but also to various games and a broader class of problems, because CTDL is generic and does not rely on any problem-specific knowledge.

show abstract

Temporal Difference Learning Versus Co-Evolution for Acquiring Othello Position Evaluation

Cited by 46 publications

References 8 publications

Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Evolving small-board Go players using coevolutionary temporal difference learning with archives

Contact Info

Product

Resources

About