Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

Ke, Nan Rosemary; Singh, Amanpreet; Touati, Ahmed; Goyal, Anirudh; Bengio, Yoshua; Parikh, Devi; Batra, Dhruv

doi:10.48550/arxiv.1903.01599

Cited by 11 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One challenge in these heuristics is that they may be unstable and difficult to fix or improve when they do not work in new environments. These heuristics can manifest in the form of learning a latent space that is locally linear, e.g., in Embed to Control and related methods (Watter et al, 2015), by enforcing that the model makes long-horizon predictions (Ke et al, 2019), ignoring uncontrollable parts of the state space (Ghosh et al, 2018), detecting and correcting when a predictive model steps off the manifold of reasonable states (Talvitie, 2017), adding reward signal prediction on top of the latent space Gelada et al (2019), or adding noise when training transitions Mankowitz et al (2019).…”

Section: Discussion Related Work and Future Workmentioning

confidence: 99%

Objective Mismatch in Model-based Reinforcement Learning

Lambert,

Amos,

Yadan

et al. 2020

Preprint

View full text Add to dashboard Cite

Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework -what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t. the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.

show abstract

Section: Discussion Related Work and Future Workmentioning

confidence: 99%

Objective Mismatch in Model-based Reinforcement Learning

Lambert,

Amos,

Yadan

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…For each trajectory in the retrieval batch, we represent each time-step within a trajectory by a set of two vectors h i,t and b i,t (Figure 6 in the appendix) where h i,t summarizes the past (i.e., from t = 0 to t = t time-steps of the i th trajectory) while b i,t summarizes the future (i.e., from t = t to t = time-steps) within the i th trajectory. In addition, taking inspiration from (Jaderberg et al, 2016;Trinh et al, 2018;Ke et al, 2019;Devlin et al, 2018;Mazoure et al, 2020), we use auxiliary losses to improve modeling of long term dependencies when training the parameters of our forward and backward summarizers. The goal of these losses is to force the representation (h i,t , b i,t ) i,t≥0 to capture meaningful information for the unknown downstream task.…”

Section: Retrieval Batch Sampling and Pre-processingmentioning

confidence: 99%

Retrieval-Augmented Reinforcement Learning

Goyal¹,

Friesen²,

Banino³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.

show abstract

“…Other related methods have attempted to create direct, structural links between models and trajectory training data. With high dimensional images (rather than states) [23] uses an auto-regressive, recurrent network to predict observations in a latent state-space. [24] proposes a multi-step Gaussian Process for learning robotic control and the approach is studied further using the correlation between prediction steps in [25].…”

Section: Predicting Trajectoriesmentioning

confidence: 99%

Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning

Wilcox¹,

Zhang²,

Pister³

et al. 2020

Preprint

View full text Add to dashboard Cite

Accurately predicting the dynamics of robotic systems is crucial for model-based control and reinforcement learning. The most common way to estimate dynamics is by fitting a one-step ahead prediction model and using it to recursively propagate the predicted state distribution over long horizons. Unfortunately, this approach is known to compound even small prediction errors, making long-term predictions inaccurate. In this paper, we propose a new parametrization to supervised learning on state-action data to stably predict at longer horizons -that we call a trajectory-based model. This trajectory-based model takes an initial state, a future time index, and control parameters as inputs, and predicts the state at the future time. Our results in simulated and experimental robotic tasks show that our trajectory-based models yield significantly more accurate long term predictions, improved sample efficiency, and ability to predict task reward.

show abstract

Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

Cited by 11 publications

References 0 publications

Objective Mismatch in Model-based Reinforcement Learning

Objective Mismatch in Model-based Reinforcement Learning

Retrieval-Augmented Reinforcement Learning

Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning

Contact Info

Product

Resources

About