Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization -an algorithmic scheme that encourages exploration -and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting.This paper develops non-asymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly -or even quadratically once it enters a local region around the optimal policy -when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
The Lasso is a method for high-dimensional regression, which is now commonly used when the number of covariates p is of the same order or larger than the number of observations n. Classical asymptotic normality theory does not apply to this model due to two fundamental reasons: (1) The regularized risk is non-smooth; (2) The distance between the estimator θ and the true parameters vector θ cannot be neglected. As a consequence, standard perturbative arguments that are the traditional basis for asymptotic normality fail.On the other hand, the Lasso estimator can be precisely characterized in the regime in which both n and p are large and n/p is of order one. This characterization was first obtained in the case of standard Gaussian designs, and subsequently generalized to other high-dimensional estimation procedures.Here we extend the same characterization to Gaussian correlated designs with non-singular covariance structure. This characterization is expressed in terms of a simpler "fixed-design" model. We establish non-asymptotic bounds on the distance between the distribution of various quantities in the two models, which hold uniformly over signals θ in a suitable sparsity class and values of the regularization parameter.As an application, we study the distribution of the debiased Lasso and show that a degrees-of-freedom correction is necessary for computing valid confidence intervals.
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a γ-discounted MDP with state space S and action space A, we demonstrate that the ∞ -based sample complexity of classical asynchronous Q-learning -namely, the number of samples needed to yield an entrywise ε-accurate estimate of the Q-function -is at most on the order of 1 μ min (1−γ ) 5 ε 2 + t mix μ min (1−γ ) up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, tmix and μmin denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the cost taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the stateof-the-art result by a factor of at least |S||A| for all scenarios, and by a factor of at least tmix|S||A| for any sufficiently small accuracy level ε. Further, we demonstrate that the scaling on the effective horizon 1 1−γ can be improved by means of variance reduction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.