Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Swazinna, Phillip; Udluft, Steffen; Hein, Daniel; Runkler, Thomas A.

doi:10.48550/arxiv.2201.05433

Cited by 2 publications

(3 citation statements)

References 31 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The goal is to learn a policy that can maximize E (s,a)∼ρ π T [r(s, a) − u(s, a)]. Existing uncertainty computations [32,39] only calculate the deviation during policy optimization without evaluating OOD generalization. Therefore, we propose an energy function to evaluate the exploration behavior through reward shaping.…”

Section: Energy-based Ood Generalization Evaluationmentioning

confidence: 99%

“…The uncertainty of the current policy reduces the interference of extrapolation errors. However, existing uncertainty factors limit the behavior to offline datasets by estimating the model discrepancies that might overfit the limited and suboptimal offline datasets [32,35]. The agent is limited to the behavior policy of offline datasets and can not achieve tasks in OOD regions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing OOD Generalization in Offline Reinforcement Learning with Energy-Based Policy Optimization

Cao,

Yang,

Huo

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Offline Reinforcement Learning (RL) is an important research domain for real-world applications because it can avert expensive and dangerous online exploration. Offline RL is prone to extrapolation errors caused by the distribution shift between offline datasets and states visited by behavior policy. Existing offline RL methods constrain the policy to offline behavior to prevent extrapolation errors. But these methods limit the generalization potential of agents in Out-Of-Distribution (OOD) regions and cannot effectively evaluate OOD generalization behavior. To improve the generalization of the policy in OOD regions while avoiding extrapolation errors, we propose an Energy-Based Policy Optimization (EBPO) method for OOD generalization. An energy function based on the distribution of offline data is proposed for the evaluation of OOD generalization behavior, instead of relying on model discrepancies to constrain the policy. The way of quantifying exploration behavior in terms of energy values can balance the return and risk. To improve the stability of generalization and solve the problem of sparse reward in complex environment, episodic memory is applied to store successful experiences that can improve sample efficiency. Extensive experiments on the D4RL datasets demonstrate that EBPO outperforms the state-of-the-art methods and achieves robust performance on challenging tasks that require OOD generalization.

show abstract

Section: Energy-based Ood Generalization Evaluationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Enhancing OOD Generalization in Offline Reinforcement Learning with Energy-Based Policy Optimization

Cao,

Yang,

Huo

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

show abstract

“…Figure 4: Evaluation performance and distance to the original policy of the LION approach over the chosen λ hyperparameter. Various state of the art baselines are added as dashed lines with their standard set of hyperparameters (results from (Swazinna et al, 2022)). Even though the baselines all exhibit some hyperparameter that controls the distance to the original policy, all are implemented differently and we can neither map them to a corresponding lambda value of our algorithm, nor change the behavior at runtime, which is why we display them as dashed lines over the entire λ spectrum.…”

Section: Industrial Benchmarkmentioning

confidence: 99%

User-Interactive Offline Reinforcement Learning

Swazinna¹,

Udluft²,

Runkler³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter -the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby overcoming both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.Preprint. Under review.

show abstract

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Cited by 2 publications

References 31 publications

Enhancing OOD Generalization in Offline Reinforcement Learning with Energy-Based Policy Optimization

Enhancing OOD Generalization in Offline Reinforcement Learning with Energy-Based Policy Optimization

User-Interactive Offline Reinforcement Learning

Contact Info

Product

Resources

About