2021
DOI: 10.48550/arxiv.2103.09575
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Regularized Behavior Value Estimation

Abstract: Offline reinforcement learning restricts the learning process to rely only on logged-data without access to an environment. While this enables real-world applications, it also poses unique challenges. One important challenge is dealing with errors caused by the over-estimation of values for state-action pairs not well-covered by the training data. Due to bootstrapping, these errors get amplified during training and can lead to divergence, thereby crippling learning. To overcome this challenge, we introduce Reg… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 16 publications
0
4
0
Order By: Relevance
“…To alleviate the severe extrapolation error in offline agent learning, we plug the filtering mechanism of CRR (Wang et al 2020c) into individual policy learning. This method can implicitly constrain the forward KL divergence between the learning policy and the behavior policy, which is widely used in offline single-agent learning (Wang et al 2020c;Nair et al 2020;Gulcehre et al 2021) and multi-agent learning (Yang et al 2021). Formally, suppose that the type of agent i is J, and its corresponding DPER is denoted as B J .…”
Section: Conservative Policy Learning With Gat-based Criticmentioning
confidence: 99%
“…To alleviate the severe extrapolation error in offline agent learning, we plug the filtering mechanism of CRR (Wang et al 2020c) into individual policy learning. This method can implicitly constrain the forward KL divergence between the learning policy and the behavior policy, which is widely used in offline single-agent learning (Wang et al 2020c;Nair et al 2020;Gulcehre et al 2021) and multi-agent learning (Yang et al 2021). Formally, suppose that the type of agent i is J, and its corresponding DPER is denoted as B J .…”
Section: Conservative Policy Learning With Gat-based Criticmentioning
confidence: 99%
“…State-action pairs outside of the dataset are therefore never actually experienced and can receive erroneously optimistic value estimates because of extrapolation errors that are not corrected by an environment feedback. The overestimation of value functions can be encouraged in online reinforcement learning, to a certain extent, as it incentivizes agents to explore and learn by trial and error (Gulcehre et al 2021;Schmidhuber 1991). Moreover, in an online setting, if the agent wrongly assigns a high value to a given action, this action will be chosen, the real return will be experimented, and the value of the action will be corrected through bootstrapping.…”
Section: Framing Offline Rl As Anti-explorationmentioning
confidence: 99%
“…Some algorithms take a different approach, understanding and solving offline RL problems from the perspective of onpolicy learning. R-BVE (Gulcehre et al, 2021) and Onestep RL (Brandfonbrener et al, 2021) both transform off-policy style offline algorithms (such as CRR (Wang et al, 2020), BCQ (Fujimoto et al, 2019), BRAC (Wu et al, 2019)) into on-policy style. Besides, BPPO (Zhuang et al, 2023) finds the online algorithm PPO (Schulman et al, 2017) can directly solve the offline RL due to its inherent conservatism.…”
Section: Related Workmentioning
confidence: 99%