2018
DOI: 10.48550/arxiv.1805.10413
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fast Policy Learning through Imitation and Reinforcement

Abstract: Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single app… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 19 publications
(23 citation statements)
references
References 26 publications
0
23
0
Order By: Relevance
“…This can be improved in the future by considering more general function approximators such as Graph Neural Networks [39] to represent the selector policy. Second, while imitation learning of clairvoyant oracles is effective, the approach may be further improved through reinforcement learning [34,35] since in in practice we do not use the exact oracle but a sub-optimal approximation which means that errors in the oracle will transfer to the learner, limiting performance.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…This can be improved in the future by considering more general function approximators such as Graph Neural Networks [39] to represent the selector policy. Second, while imitation learning of clairvoyant oracles is effective, the approach may be further improved through reinforcement learning [34,35] since in in practice we do not use the exact oracle but a sub-optimal approximation which means that errors in the oracle will transfer to the learner, limiting performance.…”
Section: Discussionmentioning
confidence: 99%
“…THOR [34] performs a multi-step search to gain advantage over the reference policy. LOKI [35] switches from IL to RL. Imitation of clairvoyant oracles has been used in multiple domains like information gathering [7], heuristic search [36], and MPC [37,38].…”
Section: Related Workmentioning
confidence: 99%
“…Many approaches investigate the incorporation of humanprovided demonstrations into policy search to drastically reduce sample complexity via a reasonable initial policy and/or the integration of demonstrations in the learning objective [32,18,6,33,42,41,23].…”
Section: A Policy Searchmentioning
confidence: 99%
“…Supervised approaches for policy learning like Learning From Demonstration (LfD) [2] can encode human prior knowledge by imitating expert examples, but do not support optimization in new environments. Combining RL with LfD is a powerful method for reducing the sample complexity of policy search, and is often used in practice [23,33,42,6]. However, * These authors contributed equally to this work (a) (b) Fig.…”
Section: Introductionmentioning
confidence: 99%
“…Batch reinforcement learning in both the tabular and functional approximator settings has long been studied (Lange et al, 2012;Strehl et al, 2010) and continues to be a highly active area of research (Swaminathan & Joachims, 2015;Jiang & Li, 2015;Thomas & Brunskill, 2016;Farajtabar et al, 2018;Irpan et al, 2019;Jaques et al, 2019). Imitation learning is also a well-studied problem (Schaal, 1999;Argall et al, 2009;Hussein et al, 2017) and also continues to be a highly active area of research (Kim et al, 2013;Piot et al, 2014;Chemali & Lazaric, 2015;Hester et al, 2018;Ho et al, 2016;Sun et al, 2017;Cheng et al, 2018;Gao et al, 2018). This paper relates most closely to (Fujimoto et al, 2018a), which made the critical observation that when conventional DQL-based algorithms are employed for batch reinforcement learning, performance can be very poor, with the algorithm possibly not learning at all.…”
Section: Related Workmentioning
confidence: 99%