In Reinforcement Learning (RL), an agent is guided by the rewards it receives from the reward function. Unfortunately, it may take many interactions with the environment to learn from sparse rewards, and it can be challenging to specify reward functions that reflect complex reward-worthy behavior. We propose using reward machines (RMs), which are automata-based representations that expose reward function structure, as a normal form representation for reward functions. We show how specifications of reward in various formal languages, including LTL and other regular languages, can be automatically translated into RMs, easing the burden of complex reward function specification. We then show how the exposed structure of the reward function can be exploited by tailored q-learning algorithms and automated reward shaping techniques in order to improve the sample efficiency of reinforcement learning methods. Experiments show that these RM-tailored techniques significantly outperform state-of-the-art (deep) RL algorithms, solving problems that otherwise cannot reasonably be solved by existing approaches.
Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible – to show the reward function’s code to the RL agent so it can exploit the function’s internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.
GBFS-based satisficing planners often augment their search with knowledge-based enhancements such as preferred operators and multiple heuristics. These techniques seek to improve planner performance by making the search more informed. In our work, we will focus on how these enhancements impact coverage and we will use a simple technique called epsilon-greedy node selection to demonstrate that planner coverage can also be improved by introducing knowledge-free random exploration into the search. We then revisit the existing knowledge-based enhancements so as to determine if the knowledge these enhancements employ is offering necessary guidance, or if the impact of this knowledge is to add exploration which can be achieved more simply using randomness. This investigation provides further evidence of the importance of preferred operators and shows that the knowledge added when using an additional heuristic is crucial in certain domains, while not being as effective as random exploration in others. Finally, we demonstrate that random exploration can also improve the coverage of LAMA, a planner which already employs multiple enhancements. This suggests that knowledge-based enhancements need to be compared to appropriate knowledge-free random baselines so as to ensure the importance of the knowledge being used.
Most bounded suboptimal algorithms in the search literature have been developed so as to be epsilon-admissible. This means that the solutions found by these algorithms are guaranteed to be no more than a factor of (1 + ε) greater than optimal. However, this is not the only possible form of suboptimality bounding. For example, another possible suboptimality guarantee is that of additive bounding, which requires that the cost of the solution found is no more than the cost of the optimal solution plus a constant gamma.In this work, we consider the problem of developing algorithms so as to satisfy a given, and arbitrary, suboptimality requirement. To do so, we develop a theoretical framework which can be used to construct algorithms for a large class of possible suboptimality paradigms. We then use the framework to develop additively bounded algorithms, and show that in practice these new algorithms effectively trade-off additive solution suboptimality for runtime.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.