2019
DOI: 10.48550/arxiv.1909.11821
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Model Imitation for Model-Based Reinforcement Learning

Abstract: Model-based reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with real-world environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in real-world environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the one-step transition models, which has inherent dif… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 9 publications
0
2
0
Order By: Relevance
“…Mishra et al [14] proposed the temporal segments model to make the prediction of the agent more accurate and stable. Wu et al [29] tried to use the WGAN learning transition model by matching the multi-step rollouts distributions and the real ones to reduce the estimation error of the model. Besides, on the basis of the Model-based Value Expansion (MVE) proposed by Feinberg et al [6], Xiao et al [30] proposed Adaptive Model-based Value Expansion (AdaMVE).…”
Section: Dyna-style Algorithmmentioning
confidence: 99%
“…Mishra et al [14] proposed the temporal segments model to make the prediction of the agent more accurate and stable. Wu et al [29] tried to use the WGAN learning transition model by matching the multi-step rollouts distributions and the real ones to reduce the estimation error of the model. Besides, on the basis of the Model-based Value Expansion (MVE) proposed by Feinberg et al [6], Xiao et al [30] proposed Adaptive Model-based Value Expansion (AdaMVE).…”
Section: Dyna-style Algorithmmentioning
confidence: 99%
“…As a result, the robot may make incorrect decisions and take sub-optimal actions when solving largescale decision-making control problems, leading to a degradation in asymptotic performance [12]. At present, various methods have been proposed to alleviate compounding model error (Wu et al [13]; Janner et al [14]; Lai et al [15]). For example, Janner et al [14] proposed MBPO (Model-Based Policy Optimization) to use truncated short-rollouts branched from real states.…”
Section: Introductionmentioning
confidence: 99%