Active Task-Inference-Guided Deep Inverse Reinforcement Learning

Memarian, Farzan; Xu, Zhe; Wen, Min; Topcu, Ufuk

doi:10.1109/cdc42340.2020.9304190

Cited by 25 publications

(18 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we note that different approaches to learn RMs were proposed simultaneously, or shortly after, our original publication (e.g., Xu et al, 2020a,b;Furelos-Blanco et al, 2020;Rens et al, 2020;Gaon and Brafman, 2020;Memarian et al, 2020;Neider et al, 2021;Hasanbeig et al, 2021). They all learn reward machines in fully observable domains.…”

Section: Related Workmentioning

confidence: 95%

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Icarte¹,

Waldie²,

Klassen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) is a central problem in artificial intelligence. This problem consists of defining artificial agents that can learn optimal behaviour by interacting with an environment -where the optimal behaviour is defined with respect to a reward signal that the agent seeks to maximize. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential. 1

show abstract

Section: Related Workmentioning

confidence: 95%

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Icarte¹,

Waldie²,

Klassen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Other related work [20] on learning from sparse rewards proposes a method to learn a temporally extended episodic task composed of several subtasks where the environment returns a sparse reward only at the end of the episodes. Using the environment's sparse feedback and queries from a demonstrator, they learn the highlevel task structure in the form of a deterministic finite state automaton, and then use the learned task structure in an inverse reinforcement learning (IRL) framework to infer a dense reward function for each subtask.…”

Section: B Sparse Rewardsmentioning

confidence: 99%

Self-Supervised Online Reward Shaping in Sparse-Reward Environments

Memarian¹,

Goo²,

Lioutikov³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We propose a novel reinforcement learning framework that performs self-supervised online reward shaping, yielding faster, sample efficient performance in sparse reward environments. The proposed framework alternates between updating a policy and inferring a reward function. While the policy update is done with the inferred, potentially dense reward function, the original sparse reward is used to provide a self-supervisory signal for the reward update by serving as an ordering over the observed trajectories. The proposed framework is based on the theory that altering the reward function does not affect the optimal policy of the original MDP as long as we maintain certain relations between the altered and the original reward. We name the proposed framework ClAssification-based REward Shaping (CaReS), since we learn the altered reward in a self-supervised manner using classifier based reward inference. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is not only significantly more sample efficient than the state-of-the-art baseline, but also achieves a similar sample efficiency to MDPs that use hand-designed dense reward functions.

show abstract

“…Learning human driver reward functions. Many previous works in learning human drivers' reward functions are extensions of MaxEnt IRL [4] and are restricted to single-agent settings [24]- [26]. Recently, the concept of quantal best response equilibrium (QRE) is exploited to extend MaxEnt IRL to multi-agent games.…”

Section: Related Workmentioning

confidence: 99%

Learning Human Rewards by Inferring Their Latent Intelligence Levels in Multi-Agent Games: A Theory-of-Mind Approach with Application to Driving Data

Tian¹,

Tomizuka²,

Sun³

2021

Preprint

View full text Add to dashboard Cite

Reward function, as an incentive representation that recognizes humans' agency and rationalizes humans' actions, is particularly appealing for modeling human behavior in human-robot interaction. Inverse Reinforcement Learning is an effective way to retrieve reward functions from demonstrations. However, it has always been challenging when applying it to multi-agent settings since the mutual influence between agents has to be appropriately modeled. To tackle this challenge, previous work either exploits equilibrium solution concepts by assuming humans as perfectly rational optimizers with unbounded intelligence or pre-assigns humans' interaction strategies a priori. In this work, we advocate that humans are bounded rational and have different intelligence levels when reasoning about others' decision-making process, and such an inherent and latent characteristic should be accounted for in reward learning algorithms. Hence, we exploit such insights from Theory-of-Mind and propose a new multi-agent Inverse Reinforcement Learning framework that reasons about humans' latent intelligence levels during learning. We validate our approach in both zero-sum and general-sum games with synthetic agents and illustrate a practical application to learning human drivers' reward functions from real driving data. We compare our approach with two baseline algorithms. The results show that by reasoning about humans' latent intelligence levels, the proposed approach has more flexibility and capability to retrieve reward functions that explain humans' driving behaviors better.

show abstract

Active Task-Inference-Guided Deep Inverse Reinforcement Learning

Cited by 25 publications

References 17 publications

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Self-Supervised Online Reward Shaping in Sparse-Reward Environments

Learning Human Rewards by Inferring Their Latent Intelligence Levels in Multi-Agent Games: A Theory-of-Mind Approach with Application to Driving Data

Contact Info

Product

Resources

About