Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Zhang, Michael R.; Paine, Tom Le; Nachum, Ofir; Paduraru, Cosmin; Tucker, George; Wang, Ziyu; Norouzi, Mohammad

doi:10.48550/arxiv.2104.13877

Cited by 3 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This form of evaluation is closer to a worst-case analysis than the typical average performance reporting. In this case, the conservative estimation is important, since there exists no equivalent to early stopping from supervised learning in the offline RL setting: Offline policy evaluation and selection are still open problems (Hans et al, 2011;Paine et al, 2020;Konyushkova et al, 2021;Zhang et al, 2021;Fu et al, 2021), so we need to stop the policy training at some random point and deploy that policy on the real system. The above procedure is meant to simulate this random stopping, and should quantify how good the algorithm performs at least.…”

Section: Discussionmentioning

confidence: 99%

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Swazinna¹,

Udluft²,

Hein³

et al. 2022

Preprint

View full text Add to dashboard Cite

Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.

show abstract

Section: Discussionmentioning

confidence: 99%

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Swazinna¹,

Udluft²,

Hein³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A variety of contemporary OPE methods has been proposed, which can be mainly divided into three categories ): (i) Inverse propensity scoring (Precup 2000), such as Importance Sampling (IS) (Doroudi, Thomas, and Brunskill 2017). (ii) Direct methods that directly estimate the value functions of the evaluation policy (Nachum et al 2019;Zhang et al 2021;Yang et al 2022), including but not limited to model-based estimators (MB) (Paduraru 2013;Zhang et al 2021), valuebased estimators (Le, Voloshin, and Yue 2019) such as Fitted Q Evaluation (FQE), and minimax estimators (Zhang, Liu, and Whiteson 2020; Yue 2021) such as DualDICE (Yang et al 2020a). (iii) Hybrid methods combine aspects of both inverse propensity scoring and direct methods (Thomas and Brunskill 2016), such as DR (Jiang and Li 2016).…”

Section: Related Workmentioning

confidence: 99%

Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring

Gao,

Yang,

Chi

2024

AAAI

View full text Add to dashboard Cite

Reinforcement learning (RL) is broadly employed in human-involved systems to enhance human outcomes. Off-policy evaluation (OPE) has been pivotal for RL in those realms since online policy learning and evaluation can be high-stake. Intelligent tutoring has raised tremendous attentions as highly challenging when applying OPE to human-involved systems, due to that students' subgroups can favor different pedagogical policies and the costly procedure that policies have to be induced fully offline and then directly deployed to the upcoming semester. In this work, we formulate on-demand pedagogical policy selection (ODPS) to tackle the challenges for OPE in intelligent tutoring. We propose a pipeline, EduPlanner, as a concrete solution for ODPS. Our pipeline results in an theoretically unbiased estimator, and enables efficient and customized policy selection by identifying subgroups over both historical data and on-arrival initial logs. We evaluate our approach on the Probability ITS that has been used in real classrooms for over eight years. Our study shows significant improvement on learning outcomes of students with EduPlanner, especially for the ones associated with low-performing subgroups.

show abstract

“…Offline policy evaluation or offline hyperparameter selection is a field concerned with mitigating the problem that offline RL algorithms cannot evaluate their policy on the true environment and thus risk deploying policies that perform worse than before: Essentially, these methods try to evaluate (or at least rank) policies that have been found by an offline RL algorithm, in order to pick the best performing one for deployment and / or to tune hyperparameters. Often, dynamics models are used to evaluate policies found in model-free algorithms, however also model-free evaluation methods exist (Hans et al, 2011;Paine et al, 2020;Konyushova et al, 2021;Zhang et al, 2021;Fu et al, 2021). Unfortunately, but also intuitively, this problem is rather hard since if any method is found that can accurately assess the policy performance (i.e.…”

Section: Related Workmentioning

confidence: 99%

“…Instead, we could do offline RL and derive a policy which potentially outperforms the users. However, in offline RL we cannot simply evaluate policy candidates on the real system and offline policy evaluation is still an open problem (Hans et al, 2011;Paine et al, 2020;Konyushova et al, 2021;Zhang et al, 2021;Fu et al, 2021). We thus run the risk of choosing the wrong proximity hyperparameter λ and achieving either lower rewards or not meeting the users personal preferences.…”

Section: Lion: Learning In Interactive Offline Environmentsmentioning

confidence: 99%

User-Interactive Offline Reinforcement Learning

Swazinna¹,

Udluft²,

Runkler³

2022

Preprint

View full text Add to dashboard Cite

Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter -the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby overcoming both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.Preprint. Under review.

show abstract

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Cited by 3 publications

References 17 publications

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring

User-Interactive Offline Reinforcement Learning

Contact Info

Product

Resources

About