Integrating Sample-Based Planning and Model-Based Reinforcement Learning

Walsh, Thomas J.; Goschin, Sergiu; Littman, Michael L.

doi:10.1609/aaai.v24i1.7689

Cited by 55 publications

(16 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Proof sketch. The proof is identical to that of Proposition 3 by Walsh et al (Walsh, Goschin, and Littman 2010), by noting that every trial of FSSS-Aux has to end at a node with zero bound gap. If such a node is not a leaf, it would not be selected in the first place, because it cannot be the node with widest bound gap.…”

Section: Fsss-aux: Fsss With π-Guided Auxiliary Armsmentioning

confidence: 63%

“…FSSS is the latest successor of SS that is able to combine the performance guarantee of SS with selective sampling (Walsh, Goschin, and Littman 2010). Instead of directly forming point estimates of the Q-and V-values in a look-ahead tree, it maintains interval estimates that allow it then to guide the sampling to the promising and/or unexplored branches of the look-ahead tree based on the widths and upper bounds of the intervals.…”

Section: Forward Search Sparse Sampling (Fsss)mentioning

confidence: 99%

“…In this paper, we investigate the auxiliary estimation approach in other more conservative simulation-based algorithms for large MDPs. In particular, we examine the Aux-method coupled with Sparse Sampling (SS) (Kearns, Mansour, and Ng 2002), the backbone of many modern simulation-based algorithms, and Forward Search Sparse Sampling (FSSS) (Walsh, Goschin, and Littman 2010), one of the state-of-the-art sparse sampling algorithms. While there have been various efforts in enhancing the performance of UCT (Gelly and Silver 2007;Finnsson and Björnsson 2008;Chaslot et al 2010;Nguyen, Lee, and Leong 2012;Keller and Helmert 2013), to the best of our knowledge, our work is among the first to investigate possible enhancement for the algorithms addressed.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Bootstrapping Simulation-Based Algorithms with a Suboptimal Policy

Silander

Lee

Leong

2014

ICAPS

View full text Add to dashboard Cite

Finding optimal policies for Markov Decision Processes with large state spaces is in general intractable. Nonetheless, simulation-based algorithms inspired by Sparse Sampling (SS) such as Upper Confidence Bound applied in Trees (UCT) and Forward Search Sparse Sampling (FSSS) have been shown to perform reasonably well in both theory and practice, despite the high computational demand. To improve the efficiency of these algorithms, we adopt a simple enhancement technique with a heuristic policy to speed up the selection of optimal actions. The general method, called Aux, augments the look-ahead tree with auxiliary arms that are evaluated by the heuristic policy. In this paper, we provide theoretical justification for the method and demonstrate its effectiveness in two experimental benchmarks that showcase the faster convergence to a near optimal policy for both SS and FSSS. Moreover, to further speed up the convergence of these algorithms at the early stage, we present a novel mechanism to combine them with UCT so that the resulting hybrid algorithm is superior to both of its components.

show abstract

Section: Fsss-aux: Fsss With π-Guided Auxiliary Armsmentioning

confidence: 63%

Section: Forward Search Sparse Sampling (Fsss)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Bootstrapping Simulation-Based Algorithms with a Suboptimal Policy

Silander

Lee

Leong

2014

ICAPS

View full text Add to dashboard Cite

show abstract

“…Interactive learning has also been posed as an RL problem with an underlying MDP (Sutton and Barto 1998). Approaches for efficient RL in dynamic domains include sample-based planning algorithms (Walsh, Goschin, and Littman 2010), and Relational RL, which uses relational representations and regression for Q-function generalization (Dzeroski, Raedt, and Driessens 2001;Tadepalli, Givan, and Driessens 2004). However, most RRL algorithms focus on planning, limit generalization to a single planning task, or do not support the desired commonsense reasoning capabilities.…”

Section: Related Workmentioning

confidence: 99%

What Can I Not Do? Towards an Architecture for Reasoning about and Learning Affordances

Sridharan

Meadows

Gómez

2017

ICAPS

View full text Add to dashboard Cite

This paper describes an architecture for an agent to learn and reason about affordances. In this architecture, Answer Set Prolog, a declarative language, is used to represent and reason with incomplete domain knowledge that includes a representation of affordances as relations defined jointly over objects and actions. Reinforcement learning and decision-tree induction based on this relational representation and observations of action outcomes are used to interactively and cumulatively (a) acquire knowledge of affordances of specific objects being operated upon by specific agents; and (b) generalize from these specific learned instances. The capabilities of this architecture are illustrated and evaluated in two simulated domains, a variant of the classic Blocks World domain, and a robot assisting humans in an office environment.

show abstract

“…Andrew et al 7 introduced shaping function into reinforcement learning and added heuristic value to the returns of agents, which effectively improved the convergence speed. Asmuth et al 8 used potential field function as a priori knowledge to enlighten the reinforcement learning process and proved the effectiveness of the algorithm.…”

Section: Introductionmentioning

confidence: 99%

Heuristic Q-learning based on experience replay for three-dimensional path planning of the unmanned aerial vehicle

et al. 2019

View full text Add to dashboard Cite

In order to solve the problem that the existing reinforcement learning algorithm is difficult to converge due to the excessive state space of the three-dimensional path planning of the unmanned aerial vehicle, this article proposes a reinforcement learning algorithm based on the heuristic function and the maximum average reward value of the experience replay mechanism. The knowledge of track performance is introduced to construct heuristic function to guide the unmanned aerial vehicles’ action selection and reduce the useless exploration. Experience replay mechanism based on maximum average reward increases the utilization rate of excellent samples and the convergence speed of the algorithm. The simulation results show that the proposed three-dimensional path planning algorithm has good learning efficiency, and the convergence speed and training performance are significantly improved.

show abstract

Integrating Sample-Based Planning and Model-Based Reinforcement Learning

Cited by 55 publications

References 15 publications

Bootstrapping Simulation-Based Algorithms with a Suboptimal Policy

Bootstrapping Simulation-Based Algorithms with a Suboptimal Policy

What Can I Not Do? Towards an Architecture for Reasoning about and Learning Affordances

Heuristic Q-learning based on experience replay for three-dimensional path planning of the unmanned aerial vehicle

Contact Info

Product

Resources

About