2019
DOI: 10.48550/arxiv.1910.12179
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

Abstract: The field of Deep Reinforcement Learning (DRL) has recently seen a surge in research in batch reinforcement learning, which aims for sample-efficient learning from a given data set without additional interactions with the environment. In the batch DRL setting, commonly employed off-policy DRL algorithms can perform poorly and sometimes even fail to learn altogether. In this paper we propose a new algorithm, Best-Action Imitation Learning (BAIL), which unlike many offpolicy DRL algorithms does not involve maxim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 17 publications
0
11
0
Order By: Relevance
“…Similarly, Conservative Q-Learning (CQL) (Kumar et al, 2020) aims to learn a conservative Q-function so that its values lower bound the true performance of the corresponding policy. Best Action Imitation Learning (BAIL) (Chen et al, 2019) is one of the very few algorithms training a state-value function to find good actions to imitate with a policy network. In Critic Regularized Regression (CRR) (Wang et al, 2020), taking actions outside the training distribution is discouraged by the use of filtered policy gradients: The new policy is trained to imitate behavior seen in the initial dataset, however the behavior is filtered by the Q-function, so that promising actions are more likely to be imitated than others.…”
Section: Related Workmentioning
confidence: 99%
“…Similarly, Conservative Q-Learning (CQL) (Kumar et al, 2020) aims to learn a conservative Q-function so that its values lower bound the true performance of the corresponding policy. Best Action Imitation Learning (BAIL) (Chen et al, 2019) is one of the very few algorithms training a state-value function to find good actions to imitate with a policy network. In Critic Regularized Regression (CRR) (Wang et al, 2020), taking actions outside the training distribution is discouraged by the use of filtered policy gradients: The new policy is trained to imitate behavior seen in the initial dataset, however the behavior is filtered by the Q-function, so that promising actions are more likely to be imitated than others.…”
Section: Related Workmentioning
confidence: 99%
“…AWR [37] is a very similar method that focuses on continuous control tasks. BAIL [6] learns an upper-envelope value function that encourages optimistic value estimates, and therefore conservative advantage estimates. Samples are filtered based on heuristics of their advantage relative to the rest of the dataset (e.g.…”
Section: Binarymentioning
confidence: 99%
“…Our goal is to build agents that can learn to pick out valuable information in large, unfiltered datasets containing experience from a mixture of sub-optimal policies. Recent work has proposed new ways to mimic the best demonstrations in a dataset [33,46,45,37,6]. In this paper, we analyze and address some of the problems that arise in determining which demonstrations are worth learning.…”
Section: Introductionmentioning
confidence: 99%
“…This is similar in spirit to the motivation provided in Figure 1. Examples of this approach include monotonic advantage re-weighted imitation learning (MARWIL) [37], best-action imitation learning (BAIL) [7], advantage-weighted behavior models (ABM) [30] and advantage weighted regression [27], which has previously been studied in the form of a Fitted Q-iteration algorithm with low-dimensional policy classes [26]. Our approach differs from most of these in that we do not rely on observed returns for advantage estimation -as elaborated on in Sec.…”
Section: Related Workmentioning
confidence: 99%
“…< l a t e x i t s h a 1 _ b a s e 6 4 = " x u P V C 2 K 6 B P k 7 , the corresponding action is marked green and the pair (st, at) is used to train the policy. If Q(st, at) < Q(st, π(st)), the action is marked red and is not used to train the policy.…”
Section: Introductionmentioning
confidence: 99%