In this work we address the problem of how to use time series data to choose from a finite set of candidate discrete state spaces, where these spaces are constructed by a domain expert. We formalize the notion of model selection consistency in the proposed setup. We then discuss the difference between our proposed framework and the classical Maximum Likelihood (ML) framework, and give an example where ML fails. Afterwards, we suggest alternative selection criteria and show them to be weakly consistent. Finally, we test the performance of the suggested criteria on both simulated and real world data.
We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm, which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying ETD(λ, β) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(λ, β). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.
We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White, 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importancesampling term. We study conditions under which the projected fixedpoint equation underlying ETD(λ, β) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(λ, β). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.
The classical Policy Iteration (PI) algorithm alternates between greedy one-step policy improvement and policy evaluation. Recent literature shows that multi-step lookahead policy improvement leads to a better convergence rate at the expense of increased complexity per iteration. However, prior to running the algorithm, one cannot tell what is the best fixed lookahead horizon. Moreover, per a given run, using a lookahead of horizon larger than one is often wasteful. In this work, we propose for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate. We devise two PI variants and analyze the trade-off between iteration count and computational complexity per iteration. The first variant takes the desired contraction factor as the objective and minimizes the per-iteration complexity. The second variant takes as input the computational complexity per iteration and minimizes the overall contraction factor. We then devise a corresponding DQN-based algorithm with an adaptive tree search horizon. We also include a novel enhancement for on-policy learning: per-depth value function estimator. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and in Atari.
The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when function approximation is applied. In this paper we propose the Consistent Off-Policy Temporal Difference (COP-TD(λ, β)) algorithm that addresses this issue and reduces this bias at some computational expense. We show that COP-TD(λ, β) can be designed to converge to the same value that would have been obtained by using on-policy TD(λ) with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call log-COP-TD(λ, β). Both algorithms have favorable empirical results to the current state of the art online OPE algorithms. Finally, our formulation sheds some new light on the recently proposed Emphatic TD learning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.