BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

Chen, Xinyue; Zhou, Zijian; Wang, Zheng; Che, Wang; Wu, Yanqiu; Ross, Keith W.

doi:10.48550/arxiv.1910.12179

Cited by 9 publications

(11 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, Conservative Q-Learning (CQL) (Kumar et al, 2020) aims to learn a conservative Q-function so that its values lower bound the true performance of the corresponding policy. Best Action Imitation Learning (BAIL) (Chen et al, 2019) is one of the very few algorithms training a state-value function to find good actions to imitate with a policy network. In Critic Regularized Regression (CRR) (Wang et al, 2020), taking actions outside the training distribution is discouraged by the use of filtered policy gradients: The new policy is trained to imitate behavior seen in the initial dataset, however the behavior is filtered by the Q-function, so that promising actions are more likely to be imitated than others.…”

Section: Related Workmentioning

confidence: 99%

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Swazinna¹,

Udluft²,

Hein³

et al. 2022

Preprint

View full text Add to dashboard Cite

Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Swazinna¹,

Udluft²,

Hein³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…AWR [37] is a very similar method that focuses on continuous control tasks. BAIL [6] learns an upper-envelope value function that encourages optimistic value estimates, and therefore conservative advantage estimates. Samples are filtered based on heuristics of their advantage relative to the rest of the dataset (e.g.…”

Section: Binarymentioning

confidence: 99%

“…Our goal is to build agents that can learn to pick out valuable information in large, unfiltered datasets containing experience from a mixture of sub-optimal policies. Recent work has proposed new ways to mimic the best demonstrations in a dataset [33,46,45,37,6]. In this paper, we analyze and address some of the problems that arise in determining which demonstrations are worth learning.…”

Section: Introductionmentioning

confidence: 99%

A Closer Look at Advantage-Filtered Behavioral Cloning in High-Noise Datasets

Grigsby,

2021

Preprint

View full text Add to dashboard Cite

Recent Offline Reinforcement Learning methods have succeeded in learning highperformance policies from fixed datasets of experience. A particularly effective approach learns to first identify and then mimic optimal decision-making strategies. Our work evaluates this method's ability to scale to vast datasets consisting almost entirely of sub-optimal noise. A thorough investigation on a custom benchmark helps identify several key challenges involved in learning from high-noise datasets. We re-purpose prioritized experience sampling to locate expert-level demonstrations among millions of low-performance samples. This modification enables offline agents to learn state-of-the-art policies in benchmark tasks using datasets where expert actions are outnumbered nearly 65 : 1.

show abstract

“…This is similar in spirit to the motivation provided in Figure 1. Examples of this approach include monotonic advantage re-weighted imitation learning (MARWIL) [37], best-action imitation learning (BAIL) [7], advantage-weighted behavior models (ABM) [30] and advantage weighted regression [27], which has previously been studied in the form of a Fitted Q-iteration algorithm with low-dimensional policy classes [26]. Our approach differs from most of these in that we do not rely on observed returns for advantage estimation -as elaborated on in Sec.…”

Section: Related Workmentioning

confidence: 99%

“…< l a t e x i t s h a 1 _ b a s e 6 4 = " x u P V C 2 K 6 B P k 7 , the corresponding action is marked green and the pair (st, at) is used to train the policy. If Q(st, at) < Q(st, π(st)), the action is marked red and is not used to train the policy.…”

Section: Introductionmentioning

confidence: 99%

Critic Regularized Regression

Wang,

Novikov,

Zolna

et al. 2020

Preprint

View full text Add to dashboard Cite

Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.

show abstract

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

Cited by 9 publications

References 17 publications

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

A Closer Look at Advantage-Filtered Behavioral Cloning in High-Noise Datasets

Critic Regularized Regression

Contact Info

Product

Resources

About