Imitation Learning as $f$-Divergence Minimization

Ke, Liyiming; Choudhury, Sanjiban; Barnes, Matt; Sun, Wen; Lee, Gilwoo; Srinivasa, Siddhartha S.

doi:10.48550/arxiv.1905.12888

Cited by 28 publications

(40 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In ValueDICE the use of KL-divergence introduces the logarithms and exponentials of expectations. We can of course use other divergences, as suggested by [14,8], but we show below that alternatively we can use one statistical measure (not a divergence), which yields a more feasible objective.…”

Section: Theoretical Analysis and Softdicementioning

confidence: 99%

See 1 more Smart Citation

SoftDICE for Imitation Learning: Rethinking Off-policy Distribution Matching

Sun,

Mahajan,

Hofmann

et al. 2021

Preprint

View full text Add to dashboard Cite

We present SoftDICE, which achieves state-of-the-art performance for imitation learning. SoftDICE fixes several key problems in ValueDICE [17], an off-policy distribution matching approach for sample-efficient imitation learning. Specifically, the objective of ValueDICE contains logarithms and exponentials of expectations, for which the mini-batch gradient estimate is always biased. Second, ValueDICE regularizes the objective with replay buffer samples when expert demonstrations are limited in number, which however changes the original distribution matching problem. Third, the re-parametrization trick used to derive the off-policy objective relies on an implicit assumption that rarely holds in training. We leverage a novel formulation of distribution matching and consider an entropy-regularized off-policy objective, which yields a completely offline algorithm called SoftDICE. Our empirical results show that SoftDICE recovers the expert policy with only one demonstration trajectory and no further on-policy/off-policy samples. Soft-DICE also stably outperforms ValueDICE and other baselines in terms of sample efficiency on Mujoco benchmark tasks.

show abstract

Section: Theoretical Analysis and Softdicementioning

confidence: 99%

“…Proposition 1 forms the foundation of a distribution matching approach for imitation learning that learns π by minimizing the divergence between state-action distribution of the policy d π (s, a) and the empirical distribution d E (s, a) of state-action pairs in the demonstration [12,14,17,8].…”

Section: Introductionmentioning

confidence: 99%

SoftDICE for Imitation Learning: Rethinking Off-policy Distribution Matching

Sun,

Mahajan,

Hofmann

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…normal distribution), only RKL-RL would acquire one of the local solutions. Inspired by this fact, even imitation learning, which minimizes forward KL divergence, has been converted into the problem for minimizing reverse KL divergence (Ke et al, 2019;Uchibe and Doya, 2020). However, it is not clear whether the obtained local solution has sufficient performance.…”

Section: Qualitative Differences Between Forward/reverse Kl Divergencesmentioning

confidence: 99%

“…According to this suggestion, a new optimization method, so-called FKL-RL, is formulated based on forward KL divergence optimization. Note that, only regarding the policy optimization, it is pointed out that traditional RL uses reverse KL divergence, while imitation learning uses forward KL divergence (Ke et al, 2019;Uchibe and Doya, 2020).…”

Section: Introductionmentioning

confidence: 99%

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

Kobayashi

2021

Preprint

View full text Add to dashboard Cite

This paper addresses a new interpretation of reinforcement learning (RL) as reverse Kullback-Leibler (KL) divergence optimization, and derives a new optimization method using forward KL divergence. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.

show abstract

“…IRL is an active research area, with fundamental methods being extended in multiple directions. Our work is complementary to many such extensions, including works considering multiple interacting agents [27], or agents with multiple sub-goals [24,25,11], skills [32], or options [17], or using adversarial reward learning to scale to high dimensional problems [13], or using specialised divergence to match a single intent in a multi-intent dataset [18]. Unlike these works, we consider the problem of learning multiple rewards from a dataset containing several unlabeled behaviour intents [3].…”

Section: Related Workmentioning

confidence: 99%

LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning

Snoswell

Singh²,

2021

Preprint

View full text Add to dashboard Cite

Multiple-Intent Inverse Reinforcement Learning (MI-IRL) seeks to find a reward function ensemble to rationalize demonstrations of different but unlabelled intents. Within the popular expectation maximization (EM) framework for learning probabilistic MI-IRL models, we present a warm-start strategy based on up-front clustering of the demonstrations in feature space. Our theoretical analysis shows that this warm-start solution produces a near-optimal reward ensemble, provided the behavior modes satisfy mild separation conditions. We also propose a MI-IRL performance metric that generalizes the popular Expected Value Difference measure to directly assesses learned rewards against the ground-truth reward ensemble. Our metric elegantly addresses the difficulty of pairing up learned and ground truth rewards via a min-cost flow formulation, and is efficiently computable. We also develop a MI-IRL benchmark problem that allows for more comprehensive algorithmic evaluations. On this problem, we find our MI-IRL warm-start strategy helps avoid poor quality local minima reward ensembles, resulting in a significant improvement in behavior clustering. Our extensive sensitivity analysis demonstrates that the quality of the learned reward ensembles is improved under various settings, including cases where our theoretical assumptions do not necessarily hold. Finally, we demonstrate the effectiveness of our methods by discovering distinct driving styles in a large real-world dataset of driver GPS trajectories.Preprint. Under review.

show abstract

Imitation Learning as $f$-Divergence Minimization

Cited by 28 publications

References 0 publications

SoftDICE for Imitation Learning: Rethinking Off-policy Distribution Matching

SoftDICE for Imitation Learning: Rethinking Off-policy Distribution Matching

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning

Contact Info

Product

Resources

About