How to Combine Tree-Search Methods in Reinforcement Learning

Efroni, Yonathan; Dalal, Gal; Scherrer, Bruno; Mannor, Shie

doi:10.1609/aaai.v33i01.33013494

Cited by 19 publications

(36 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our experiment is even more significant considering that Efroni et al [14] recently proved that the Bellman update should be replaced so that contraction is guaranteed for tree-based policies only when the value at the leaves is backed. However, this theory was not supported by empirical evidence beyond a toy maze.…”

Section: Training With Tree Searchmentioning

confidence: 99%

“…We find this method to be beneficial in several of the games we tested. In the experiments below, we treat the correction from [14] as a hyper-parameter and include ablation studies of it in Appendix C.3.…”

Section: Training With Tree Searchmentioning

confidence: 99%

“…In [29], the trade-off between learning and planning using a TS was empirically tested. Finally, look-ahead policies for RL were also studied theoretically; bounds on the suboptimality of the learned policy were given in [12,13,14]. There, the focus was on the effect of planning on the learning process.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Hallak

Dalal

Dalton

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: distribution shift and scalability. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

show abstract

Section: Training With Tree Searchmentioning

confidence: 99%

Section: Training With Tree Searchmentioning

confidence: 99%

See 1 more Smart Citation

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Hallak

Dalal

Dalton

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…• Multi-step approximate dynamic programming: More complex integrations use a form a multi-step approximate dynamic programming (Efroni et al, 2019(Efroni et al, , 2018.…”

Section: Model-based Reinforcement Learningmentioning

confidence: 99%

A Unifying Framework for Reinforcement Learning and Planning

Moerland¹,

Broekens²,

Plaat³

et al. 2020

Preprint

View full text Add to dashboard Cite

Sequential decision making, commonly formalized as Markov Decision Process optimization, is a key challenge in artificial intelligence. Two successful approaches to MDP optimization are planning and reinforcement learning. Both research fields largely have their own research communities. However, if both research fields solve the same problem, then we should be able to disentangle the common factors in their solution approaches. Therefore, this paper presents a unifying framework for reinforcement learning and planning (FRAP), which identifies the underlying dimensions on which any planning or learning algorithm has to decide. At the end of the paper, we compare -in a single table -a variety of well-known planning, model-free and model-based RL algorithms along the dimensions of our framework, illustrating the validity of the framework. Altogether, FRAP provides deeper insight into the algorithmic space of planning and reinforcement learning, and also suggests new approaches to integration of both fields.

show abstract

“…Several recent works rigorously analyzed the properties of multi-step lookahead in common RL schemes (Efroni et al, 2018a(Efroni et al, ,b, 2019(Efroni et al, , 2020Hallak et al, 2021). This and other related literature studied a fixed planning horizon chosen in advance.…”

Section: Introductionmentioning

confidence: 99%

Planning and Learning with Adaptive Lookahead

Rosenberg¹,

Hallak²,

Mannor³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The classical Policy Iteration (PI) algorithm alternates between greedy one-step policy improvement and policy evaluation. Recent literature shows that multi-step lookahead policy improvement leads to a better convergence rate at the expense of increased complexity per iteration. However, prior to running the algorithm, one cannot tell what is the best fixed lookahead horizon. Moreover, per a given run, using a lookahead of horizon larger than one is often wasteful. In this work, we propose for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate. We devise two PI variants and analyze the trade-off between iteration count and computational complexity per iteration. The first variant takes the desired contraction factor as the objective and minimizes the per-iteration complexity. The second variant takes as input the computational complexity per iteration and minimizes the overall contraction factor. We then devise a corresponding DQN-based algorithm with an adaptive tree search horizon. We also include a novel enhancement for on-policy learning: per-depth value function estimator. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and in Atari.

show abstract

How to Combine Tree-Search Methods in Reinforcement Learning

Cited by 19 publications

References 9 publications

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

A Unifying Framework for Reinforcement Learning and Planning

Planning and Learning with Adaptive Lookahead

Contact Info

Product

Resources

About