Learning reward machines: A study in partially observable reinforcement learning

Icarte, Rodrigo Toro; Klassen, Toryn Q.; Valenzano, Richard; Castro, Margarita P.; Waldie, Ethan; McIlraith, Sheila A.

doi:10.1016/j.artint.2023.103989

Cited by 4 publications

(3 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We showed that these are less efficient than our approach by benchmarking against the only available codebase that learnt an automaton from our traces. Meanwhile, Icarte et al [32] use Tabu search, which relies on already knowing a partial model, and Furelos-Blanco et al [12] use an Answer Set Programming algorithm, which assumes a known upper bound for the maximum finite distance between TA states. Our assumptions are weaker than these as we are only required to guess an upper bound on the number of TA states (in Section 4, we explained why this is easy to do) and we do not require any a priori knowledge about the spatial MDP.…”

Section: Related Researchmentioning

confidence: 99%

Learning Task Automata for Reinforcement Learning Using Hidden Markov Models

Abate,

Almulla,

Fox

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification. We learn non-Markovian finite task specifications as finite-state ‘task automata’ from episodes of agent experience within environments with unknown dynamics. First, we learn a product MDP, a model composed of the specification’s automaton and the environment’s MDP (both initially unknown), by treating it as a partially observable MDP and employing a hidden Markov model learning algorithm. Second, we efficiently distil the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our automaton enables a task to be decomposed into sub-tasks, so an RL agent can later synthesise an optimal policy more efficiently. It is also an interpretable encoding of high-level task features, so a human can verify that the agent’s learnt tasks have no misspecifications. Finally, we also take steps towards ensuring that the automaton is environment-agnostic, making it well-suited for use in transfer learning.

show abstract

Section: Related Researchmentioning

confidence: 99%

Learning Task Automata for Reinforcement Learning Using Hidden Markov Models

Abate,

Almulla,

Fox

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

show abstract

“…There have been several proposals for the automated generation of RMs. For example, in [19,20] an RM is learned by an agent through experience in the environment. Closer to our work is [4], where an RM for a single agent is generated using LTL and other logics that are equivalent to regular languages, and [9] where a single-agent RM is generated from a sequential or a partial order plan.…”

Section: Related Workmentioning

confidence: 99%

“…Reward machines (RMs) [4,18,19] have recently been proposed as a way of specifying rewards for reinforcement learning (RL) agents. RMs are Mealy machines used to specify tasks and rewards based on a high-level abstraction of the agent's environment.…”

Section: Introductionmentioning

confidence: 99%

Synthesising Reward Machines for Cooperative Multi-Agent Reinforcement Learning

Varricchione,

Alechina,

Dastani

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Reward machines have recently been proposed as a means of encoding team tasks in cooperative multi-agent reinforcement learning. The resulting multi-agent reward machine is then decomposed into individual reward machines, one for each member of the team, allowing agents to learn in a decentralised manner while still achieving the team task. However, current work assumes the multi-agent reward machine to be given. In this paper, we show how reward machines for team tasks can be synthesised automatically from an Alternating-Time Temporal Logic specification of the desired team behaviour and a high-level abstraction of the agents' environment. We present results suggesting that our automated approach has comparable, if not better, sample efficiency than reward machines generated by hand for multi-agent tasks.

show abstract