Offline Reinforcement Learning as Anti-exploration

Rezaeifar, Shideh; Dadashi, Robert; Vieillard, Nino; Hussenot, Léonard; Bachem, Olivier; Pietquin, Olivier; Geist, Matthieu

doi:10.1609/aaai.v36i7.20783

Cited by 12 publications

(12 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We describe a theorem that answers the first question: soft policy iteration [17] with a penalized value function Q is equivalent to policy iteration regularized by KL π(s) π p (s) where π p (a|s) = softmax − p(s, a) . This theorem is a generalized version of the theorem shown in [39], which does not require unnecessary assumptions on the penalty function.…”

Section: Theoretic Background On Direct Q-penalizationmentioning

confidence: 94%

“…One straightforward solution to the over-estimation problem is directly penalizing Q estimation [39,6] with a penalty function p(s, a):…”

Section: Methodsmentioning

confidence: 99%

“…Therefore, we want to design a penalty function that resembles the oracle error. Since the estimation error is likely to occur more often for out-ofdistribution state-action pairs (s, a ) / ∈ D, a few ad-hoc methods have been proposed for the purpose of measuring the epistemic uncertainty of Q θ ; the aleatoric uncertainty of the transition dynamics model is suggested as a proxy for the epistemic uncertainty of Q θ [52], and generative models [39] or pseudometrics [6] are proposed with the purpose of distinguishing whether a particular (s, a) is in-distribution or not. However, it has not been thoroughly investigated how penalty functions affect the policy iteration process, nor what penalty functions are best for offline RL.…”

Section: Methodsmentioning

confidence: 99%

“…Another direction of research tackles offline RL problems via value learning, trying to resolve the distribution shift problem that arises in the offline setup. Since distribution shift commonly results in overestimation of values, the algorithms belonging to this category try to estimate values pessimistically for out-of-distribution inputs [29,14], sometimes by explicitly quantifying the certainty with a trained transition dynamics model [52,21] or a generative model [39,6]. Intuitively, value-based offline RL methods are preferred over the behavior cloning-based counterpart since the value function could tell both what should and should not be done while imitation learning only utilizes the half side of the information.…”

Section: Related Workmentioning

confidence: 99%

“…However, their use of MMD distance for the constraint is only empirically justified with certain conditions, such as the small number of samples in computing the distance [25]. Another common problem that arises in other penalty functions is their use of proxy and their formulation; for example, a conditional variational autoencoder (CVAE) [39], a pseudometric [6], or a transition dynamics model [52] are estimated instead of the behavior policy β, and penalty functions are designed heuristically with the proxy estimates. While such formulations could show some positive correlation to the suggested penalty function, there is no clear connection that allows us to interpret the penalty in terms of β or the support set.…”

Section: What Makes a Good Penalty Function?mentioning

confidence: 99%

See 4 more Smart Citations

Know Your Boundaries: The Necessity of Explicit Behavioral Cloning in Offline RL

Goo¹,

Niekum²

2022

Preprint

View full text Add to dashboard Cite

We introduce an offline reinforcement learning (RL) algorithm that explicitly clones a behavior policy to constrain value learning. In offline RL, it is often important to prevent a policy from selecting unobserved actions, since the consequence of these actions cannot be presumed without additional information about the environment. One straightforward way to implement such a constraint is to explicitly model a given data distribution via behavior cloning and directly force a policy not to select uncertain actions. However, many offline RL methods instantiate the constraint indirectly-for example, pessimistic value estimation-due to a concern about errors when modeling a potentially complex behavior policy. In this work, we argue that it is not only viable but beneficial to explicitly model the behavior policy for offline RL because the constraint can be realized in a stable way with the trained model. We first suggest a theoretical framework that allows us to incorporate behavior-cloned models into value-based offline RL methods, enjoying the strength of both explicit behavior cloning and value learning. Then, we propose a practical method utilizing a score-based generative model for behavior cloning. With the proposed method, we show state-of-the-art performance on several datasets within the D4RL and Robomimic benchmarks and achieve competitive performance across all datasets tested. IntroductionThe goal of offline reinforcement learning (RL) is to learn a policy purely from pre-generated data. This data-driven RL paradigm is promising since it opens up a possibility for RL to be widely applied to many realistic scenarios where large-scale data is available. Two primary targets need to be considered in designing offline RL algorithms: maximizing reward and staying close to the provided data. Finding a policy that maximizes the accumulated sum of rewards is the main objective in RL, and this can be achieved via learning an optimal Q-value function. However, in the offline setup, it is often infeasible to infer a precise optimal Q-value function due to limited data coverage [32,34]; for example, the value of states not shown in the dataset cannot be estimated without additional assumptions about the environment. This implies that value learning can typically be performed accurately only for the subset of the state (or state-action) space covered by a dataset. Because of this limitation, some form of imitation learning objectives that can force a policy to stay close to the given data warrants consideration in offline RL.Recently, many offline RL algorithms have been proposed that instantiate an imitation learning objective without explicitly modeling the data distribution of the provided dataset. For instance, one approach applies the pessimism under uncertainty principle in value learning [4,29,23] in order to prevent out-of-distribution actions from being selected. While these methods show promising practical results for certain domains, it has also been reported that such methods fall short compared Preprint...

show abstract

Section: Theoretic Background On Direct Q-penalizationmentioning

confidence: 94%

“…One straightforward solution to the over-estimation problem is directly penalizing Q estimation [39,6] with a penalty function p(s, a):…”

Section: Methodsmentioning

confidence: 99%