Harshat Kumar scite author profile

Harshat Kumar

5Publications

84Citation Statements Received

55Citation Statements Given

How they've been cited

108

How they cite others

Affiliations

University of Pennsylvania, Rutgers, The State University of New Jersey, Apple (United States)

Publications

Order By: Most citations

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Kumar¹,

Koppel²,

Ribeiro³

2019

Preprint

View full text Add to dashboard Cite

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.

show abstract

A joint design approach for spectrum sharing between radar and communication systems

Kumar

Petropulu

2016

View full text Add to dashboard Cite

Zeroth-order Deterministic Policy Gradient

Kumar¹,

Kalogerias²,

Pappas³

et al. 2020

Preprint

View full text Add to dashboard Cite

On the sample complexity of actor-critic method for reinforcement learning with function approximation

2023

View full text Add to dashboard Cite

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.

show abstract

Actor-only Deterministic Policy Gradient via Zeroth-order Gradient Oracles in Action Space

Kumar

Kalogerias

Pappas

et al. 2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Harshat Kumar

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

A joint design approach for spectrum sharing between radar and communication systems

Zeroth-order Deterministic Policy Gradient

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Actor-only Deterministic Policy Gradient via Zeroth-order Gradient Oracles in Action Space

Contact Info

Product

Resources

About