Philip S. Thomas scite author profile

This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to

show abstract

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Thomas¹,

Brunskill²

2016

Preprint

View full text Add to dashboard Cite

Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL

Bhatia¹,

Thomas²,

Zilberstein³

2022

Preprint

View full text Add to dashboard Cite

Model-based reinforcement learning promises to learn an optimal policy from fewer interactions with the environment compared to model-free reinforcement learning by learning an intermediate model of the environment in order to predict future interactions. When predicting a sequence of interactions, the rollout length, which limits the prediction horizon, is a critical hyperparameter as accuracy of the predictions diminishes in the regions that are further away from real experience. As a result, with a longer rollout length, an overall worse policy is learned in the long run. Thus, the hyperparameter provides a trade-off between quality and efficiency. In this work, we frame the problem of tuning the rollout length as a meta-level sequential decision-making problem that optimizes the final policy learned by model-based reinforcement learning given a fixed budget of environment interactions by adapting the hyperparameter dynamically based on feedback from the learning process, such as accuracy of the model and the remaining budget of interactions. We use model-free deep reinforcement learning to solve the meta-level decision problem and demonstrate that our approach outperforms common heuristic baselines on two well-known reinforcement learning environments.Preprint. Under review.

show abstract

Reinforcement Learning When All Actions are Not Always Available

Chandak¹,

Theocharous²,

Metevier³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Philip S. Thomas

Learning Action Representations for Reinforcement Learning

Increasing the Action Gap: New Operators for Reinforcement Learning

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL

Reinforcement Learning When All Actions are Not Always Available

Contact Info

Product

Resources

About