Model-Based Offline Planning with Trajectory Pruning

Zhang, Ke; Pu, Xinyan; Li, Jiaxing; Wu, Jiasong; Hu, Shu; Kong, Youyong

doi:10.24963/ijcai.2022/516

Cited by 7 publications

(6 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Value regularization methods [8,14,32] regularize the value function to assign low values on OOD actions. Uncertainty-based and model-based methods [1,30,[36][37][38] estimate the epistemic uncertainty from value functions or learned models to penalize OOD data. Finally, in-sample learning methods [3,9,31] learn the value function entirely within data to avoid directly querying the Q-function on OOD actions produced by policies.…”

Section: Related Workmentioning

confidence: 99%

Beijing Language and Culture University, China

Wang¹

2022

The Experience of Examining the PhD

View full text Add to dashboard Cite

In the study of network synchronization, an outstanding question of both theoretical and practical significance is how to allocate a given set of heterogenous oscillators on a complex network in order for improving the synchronization performance. Whereas methods have been proposed to address this question in literature, the methods are based on accurate models describing the system dynamics, which, however, are normally unavailable in realistic situations. Here we show that this question can be addressed by the model-free technique of feed-forward neural network (FNN) in machine learning. Specifically, we measure the synchronization performance of a number of allocation schemes and use the measured data to train a machine. It is found that the trained machine is able to not only infer the synchronization performance of any new allocation scheme, but also find from a huge amount of candidates the optimal allocation scheme for synchronization.

show abstract

Section: Related Workmentioning

confidence: 99%

Beijing Language and Culture University, China

Wang¹

2022

The Experience of Examining the PhD

View full text Add to dashboard Cite

show abstract

“…We, instead, aim at offering higherquality synthetic data for offline training via an uncertainty-based trajectory truncation method. The most relevant to our work are M2AC [33], MOReL [17], and MOPP [45]. MOReL requires the generated single sample (instead of the trajectory) at each step to lie in a safe region.…”

Section: Related Workmentioning

confidence: 99%

“…However, MBPO fails to resolve the issue of extrapolation error in the offline setting. Modern model-based offline RL methods usually leverage methods like uncertainty quantification [44,17], penalizing value function to enforce conservatism [43], planning [45], for learning meaningful policies from static logged data. Utilizing the learned dynamics model for offline data augmentation is also explored recently [38,27].…”

Section: Introductionmentioning

confidence: 99%

Uncertainty-Driven Trajectory Truncation for Data Augmentation in Offline Reinforcement Learning

Zhang,

Lyu,

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Equipped with the trained environmental dynamics, model-based offline reinforcement learning (RL) algorithms can often successfully learn good policies from fixed-sized datasets, even some datasets with poor quality. Unfortunately, however, it can not be guaranteed that the generated samples from the trained dynamics model are reliable (e.g., some synthetic samples may lie outside of the support region of the static dataset). To address this issue, we propose Trajectory Truncation with Uncertainty (TATU), which adaptively truncates the synthetic trajectory if the accumulated uncertainty along the trajectory is too large. We theoretically show the performance bound of TATU to justify its benefits. To empirically show the advantages of TATU, we first combine it with two classical model-based offline RL algorithms, MOPO and COMBO. Furthermore, we integrate TATU with several off-the-shelf model-free offline RL algorithms, e.g., BCQ. Experimental results on the D4RL benchmark show that TATU significantly improves their performance, often by a large margin. Code is available here.

show abstract

“…Incorporating the Dynamics Model. Model-based approaches have been widely adopted in RL to improve sample efficiency and shows good performance and generalization ability in recent offline RL studies [20,53,56]. In our work, we introduce a probabilistic dynamics model implemented using a neural network that outputs a Gaussian distribution over the difference between the current and next state, i.e., f (s |s, a) = N (s + µ θ f (s, a), Σ θ f (s, a)), where µ θ f (s, a) and Σ θ f (s, a) are the parameterized mean and diagonal covariance matrix.…”

Section: Discriminator-guided Model-based Imitation Learning (Dmil)mentioning

confidence: 99%

“…The sample efficiency requirement for offline IL methods reminds us of the success of model-based approaches in the online and offline RL domains [21,20,53,56]. Dynamics models learned from the data can greatly supplement the limited expert data to improve state-action space coverage, leading to potentially improved policy performance and generalizability [20,56,15,5]. However, adopting a model-based approach in offline IL is still an underexplored area [15,5,37].…”

Section: Introductionmentioning

confidence: 99%

Discriminator-Guided Model-Based Offline Imitation Learning

Zhang¹,

Xu²,

Niu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Offline imitation learning (IL) is a powerful method to solve decisionmaking problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data due to covariate shift. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperativeyet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.

show abstract

Model-Based Offline Planning with Trajectory Pruning

Cited by 7 publications

References 3 publications

Beijing Language and Culture University, China

Beijing Language and Culture University, China

Uncertainty-Driven Trajectory Truncation for Data Augmentation in Offline Reinforcement Learning

Discriminator-Guided Model-Based Offline Imitation Learning

Contact Info

Product

Resources

About