Dynamic Automaton-Guided Reward Shaping for Monte Carlo Tree Search

Reinforcement Learning (RL) has made significant strides in enabling artificial agents to learn diverse behaviors. However, learning an effective policy often requires a large number of environment interactions. To mitigate sample complexity issues, recent approaches have used high-level task specifications, such as Linear Temporal Logic (LTLf) formulas or Reward Machines (RM), to guide the learning progress of the agent. In this work, we propose a novel approach, called Logical Specifications-guided Dynamic Task Sampling (LSTS), that learns a set of RL policies to guide an agent from an initial state to a goal state based on a high-level task specification, while minimizing the number of environmental interactions. Unlike previous work, LSTS does not assume information about the environment dynamics or the Reward Machine, and dynamically samples promising tasks that lead to successful goal policies. We evaluate LSTS on a gridworld and show that it achieves improved time-to-threshold performance on complex sequential decision-making problems compared to state-of-the-art RM and Automaton-guided RL baselines, such as Q-Learning for Reward Machines and Compositional RL from logical Specifications (DIRL). Moreover, we demonstrate that our method outperforms RM and Automaton-guided RL baselines in terms of sample-efficiency, both in a partially observable robotic task and in a continuous control robotic manipulation task.

Section: Related Workmentioning

confidence: 99%

Logical Specifications-guided Dynamic Task Sampling for Reinforcement Learning Agents

Shukla,

Burman,

Kulkarni

et al. 2024

“…Non-Markovian planning has been studied extensively (Thiébaux et al 2006) (Brafman and De Giacomo 2019) (Bacchus, Boutilier, and Grove 1997) (Thiébaux, Kabanza, and Slaney 2002) (Gretton 2006) (Velasquez et al 2021). However, it is typically assumed that the goal is given in some automaton representation.…”

Section: Related Workmentioning

confidence: 99%

Active Grammatical Inference for Non-Markovian Planning

Topper

Atia

Trivedi

et al. 2022

Self Cite

Planning in finite stochastic environments is canonically posed as a Markov decision process where the transition and reward structures are explicitly known. Reinforcement learning (RL) lifts the explicitness assumption by working with sampling models instead. Further, with the advent of reward machines, we can relax the Markovian assumption on the reward. Angluin's active grammatical inference algorithm L* has found novel application in explicating reward machines for non-Markovian RL. We propose maintaining the assumption of explicit transition dynamics, but with an implicit non-Markovian reward signal, which must be inferred from experiments. We call this setting non-Markovian planning, as opposed to non-Markovian RL. The proposed approach leverages L* to explicate an automaton structure for the underlying planning objective. We exploit the environment model to learn an automaton faster and integrate it with value iteration to accelerate the planning. We compare against recent non-Markovian RL solutions which leverage grammatical inference, and establish complexity results that illustrate the difference in runtime between grammatical inference in planning and RL settings.

“…As a result, the reward shaping values of the a-transitions would be much higher than those of the b-transitions. In this paper, we extend the dynamic reward shaping approach presented in (Velasquez et al 2021) to handle multiple agents and account for both deterministic and stochastic transitions in the underlying NMRDP. It is worth noting that multi-agent reward machines have been considered recently in the context of reinforcement learning (Neary et al 2021).…”

Section: Related Workmentioning

confidence: 99%

“…Rather, we propose a novel dynamic reward shaping function that aids the MATS procedure to converge more quickly to higher expected values. Dynamic reward shaping was first introduced in (Velasquez et al 2021) for single-agent MCTS. We demonstrate how cooperative and competitive behavior can arise within and across teams by sharing the same search tree in MATS as well as sharing the same DFA objective within the respective teams.…”

Section: Introductionmentioning

confidence: 99%

Multi-Agent Tree Search with Dynamic Reward Shaping

Velasquez

Bissey

Barak

et al. 2022

Self Cite

Sparse rewards and their representation in multi-agent domains remains a challenge for the development of multi-agent planning systems. While techniques from formal methods can be adopted to represent the underlying planning objectives, their use in facilitating and accelerating learning has witnessed limited attention in multi-agent settings. Reward shaping methods that leverage such formal representations in single-agent settings are typically static in the sense that the artificial rewards remain the same throughout the entire learning process. In contrast, we investigate the use of such formal objective representations to define novel reward shaping functions that capture the learned experience of the agents. More specifically, we leverage the automaton representation of the underlying team objectives in mixed cooperative-competitive domains such that each automaton transition is assigned an expected value proportional to the frequency with which it was observed in successful trajectories of past behavior. This form of dynamic reward shaping is proposed within a multi-agent tree search architecture wherein agents can simultaneously reason about the future behavior of other agents as well as their own future behavior.