2021
DOI: 10.48550/arxiv.2106.03155
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SoftDICE for Imitation Learning: Rethinking Off-policy Distribution Matching

Mingfei Sun,
Anuj Mahajan,
Katja Hofmann
et al.

Abstract: We present SoftDICE, which achieves state-of-the-art performance for imitation learning. SoftDICE fixes several key problems in ValueDICE [17], an off-policy distribution matching approach for sample-efficient imitation learning. Specifically, the objective of ValueDICE contains logarithms and exponentials of expectations, for which the mini-batch gradient estimate is always biased. Second, ValueDICE regularizes the objective with replay buffer samples when expert demonstrations are limited in number, which ho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 15 publications
0
2
0
Order By: Relevance
“…ValueDICE incorporates distribution correction estimation [25] to remove on-policy dependency. However, the policy objective contains logarithms and exponential expectations, which introduce biases in its gradients [26]. On the other hand, OPOLO adopts an off-policy transition to IL in principle manner by deriving an upper-bound of the IL objective, which removes such biases.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…ValueDICE incorporates distribution correction estimation [25] to remove on-policy dependency. However, the policy objective contains logarithms and exponential expectations, which introduce biases in its gradients [26]. On the other hand, OPOLO adopts an off-policy transition to IL in principle manner by deriving an upper-bound of the IL objective, which removes such biases.…”
Section: Related Workmentioning
confidence: 99%
“…Although KL-Divergence is used in many applications, it is known to have biases in gradients [26]. Especially, the expectation of a logarithm and exponential yields biases in mini-batch training.…”
Section: A Off-policy Learning In Inverse Reinforcement Learningmentioning
confidence: 99%