2021
DOI: 10.48550/arxiv.2104.13877
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Michael R. Zhang,
Tom Le Paine,
Ofir Nachum
et al.

Abstract: Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…This form of evaluation is closer to a worst-case analysis than the typical average performance reporting. In this case, the conservative estimation is important, since there exists no equivalent to early stopping from supervised learning in the offline RL setting: Offline policy evaluation and selection are still open problems (Hans et al, 2011;Paine et al, 2020;Konyushkova et al, 2021;Zhang et al, 2021;Fu et al, 2021), so we need to stop the policy training at some random point and deploy that policy on the real system. The above procedure is meant to simulate this random stopping, and should quantify how good the algorithm performs at least.…”
Section: Discussionmentioning
confidence: 99%
“…This form of evaluation is closer to a worst-case analysis than the typical average performance reporting. In this case, the conservative estimation is important, since there exists no equivalent to early stopping from supervised learning in the offline RL setting: Offline policy evaluation and selection are still open problems (Hans et al, 2011;Paine et al, 2020;Konyushkova et al, 2021;Zhang et al, 2021;Fu et al, 2021), so we need to stop the policy training at some random point and deploy that policy on the real system. The above procedure is meant to simulate this random stopping, and should quantify how good the algorithm performs at least.…”
Section: Discussionmentioning
confidence: 99%
“…A variety of contemporary OPE methods has been proposed, which can be mainly divided into three categories ): (i) Inverse propensity scoring (Precup 2000), such as Importance Sampling (IS) (Doroudi, Thomas, and Brunskill 2017). (ii) Direct methods that directly estimate the value functions of the evaluation policy (Nachum et al 2019;Zhang et al 2021;Yang et al 2022), including but not limited to model-based estimators (MB) (Paduraru 2013;Zhang et al 2021), valuebased estimators (Le, Voloshin, and Yue 2019) such as Fitted Q Evaluation (FQE), and minimax estimators (Zhang, Liu, and Whiteson 2020; Yue 2021) such as DualDICE (Yang et al 2020a). (iii) Hybrid methods combine aspects of both inverse propensity scoring and direct methods (Thomas and Brunskill 2016), such as DR (Jiang and Li 2016).…”
Section: Related Workmentioning
confidence: 99%
“…Offline policy evaluation or offline hyperparameter selection is a field concerned with mitigating the problem that offline RL algorithms cannot evaluate their policy on the true environment and thus risk deploying policies that perform worse than before: Essentially, these methods try to evaluate (or at least rank) policies that have been found by an offline RL algorithm, in order to pick the best performing one for deployment and / or to tune hyperparameters. Often, dynamics models are used to evaluate policies found in model-free algorithms, however also model-free evaluation methods exist (Hans et al, 2011;Paine et al, 2020;Konyushova et al, 2021;Zhang et al, 2021;Fu et al, 2021). Unfortunately, but also intuitively, this problem is rather hard since if any method is found that can accurately assess the policy performance (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…Instead, we could do offline RL and derive a policy which potentially outperforms the users. However, in offline RL we cannot simply evaluate policy candidates on the real system and offline policy evaluation is still an open problem (Hans et al, 2011;Paine et al, 2020;Konyushova et al, 2021;Zhang et al, 2021;Fu et al, 2021). We thus run the risk of choosing the wrong proximity hyperparameter λ and achieving either lower rewards or not meeting the users personal preferences.…”
Section: Lion: Learning In Interactive Offline Environmentsmentioning
confidence: 99%