2019
DOI: 10.48550/arxiv.1906.08253
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

When to Trust Your Model: Model-Based Policy Optimization

Abstract: Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of modelgenerated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always pref… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
99
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 48 publications
(99 citation statements)
references
References 10 publications
0
99
0
Order By: Relevance
“…Model-based RL approaches typically alternate between fitting a predictive model of the environment dynamics/rewards and updating the control policies. The model can be used in various ways, such as execution-time planning [5,21], generating imaginary experiences for training the control policy [12,32]), etc. Our work is inspired by [7], which addresses the problem of error in long-horizon model dynamics prediction.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Model-based RL approaches typically alternate between fitting a predictive model of the environment dynamics/rewards and updating the control policies. The model can be used in various ways, such as execution-time planning [5,21], generating imaginary experiences for training the control policy [12,32]), etc. Our work is inspired by [7], which addresses the problem of error in long-horizon model dynamics prediction.…”
Section: Related Workmentioning
confidence: 99%
“…Atari and Go), problems in robotics often have high-dimensional continuous states and actions, and are often limited by real-world sample budgets [14]. To this end, prior research in robotic learning have developed RL algorithms capable of performing continuous control [10,11,17,28], and sample-efficient learning methods, e.g., [1,7,12].…”
Section: Introductionmentioning
confidence: 99%
“…On the one hand stands the use of model predictive control in the engineering community where finely specified dynamics models are constructed by engineers and only a small number of parameters are fit with system identification to determine mass, inertia, joint stiffness, etc. On the other side of things stands the hands off approach taken in the RL community, where general and unstructured neural networks are used for both transition models [9,55,25] as well as policies and value functions [20]. The state and action spaces for these systems are highly complex with many diverse inputs like quaternions, joint angles, forces, torques that each transform in different ways under a symmetry transformation like a left-right reflection or a rotation.…”
Section: Approximate Symmetries In Reinforcement Learningmentioning
confidence: 99%
“…State of the art model based approaches on Mujoco tend to use an ensemble of small MLPs that predict the state transitions [9,55,25,2], without exploiting any structure of the state space. We evaluate test rollout predictions via the relative error of the state over different length horizons for the RPP model against an MLP, the method of choice.…”
Section: Better Transitionmentioning
confidence: 99%
“…Consequently, so far, many RL milestones have been achieved through simulating conspicuous amounts of experience and tuning for effective task-specific parameters (Mnih et al, 2013;Silver et al, 2017). Recent off-policy model-free (Chen et al, 2021) and model-based algorithms (Janner et al, 2019), pushed forward the state-of-the-art sample-efficiency on several benchmark simulation tasks (Brockman et al, 2016). We attribute such improvements to two main linked advances: more expressive models to capture uncertainty and better strategies to counteract detrimental biases from the learning process.…”
Section: Introductionmentioning
confidence: 99%