On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Kumar, Harshat; Koppel, Alec; Ribeiro, Alejandro

doi:10.48550/arxiv.1910.08412

Cited by 29 publications

(54 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [125] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates k required to attain inf 0≤m≤k ∇J(θ (k) ) 2 < ε. We denote this smallest number as K. When the actor uses a policy gradient method, the critic achieves K ≤ O (1/ε 4 ) by employing TD(0), K ≤ O (1/ε 3 ) by employing the Gradient Temporal Difference, and K ≤ O (1/ε 5/2 ) by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Recent Advances in Reinforcement Learning in Finance

Hambly,

Xu,

Yang

2021

Preprint

View full text Add to dashboard Cite

The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision-making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value and policy based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. Our survey concludes by discussing the application of these RL algorithms in a variety of decision-making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo-advising.

show abstract

Section: Discussionmentioning

confidence: 99%

“…In the nested-loop setting (see, e.g. [125,228]), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time-scale setting (see, e.g.…”

Section: Actor-critic Methodsmentioning

confidence: 99%

Recent Advances in Reinforcement Learning in Finance

Hambly,

Xu,

Yang

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Second, in the RL literature actor-critic algorithms also aim to solve a problem similar to (4), where θ and ω are referred to as the actor and critic parameters, respectively; see for example [17,20,26,[29][30][31]. Among these works, only the works in [17,29] consider an online setting similar to the one studied in this paper.…”

Section: Related Workmentioning

confidence: 99%

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Zeng¹,

Doan²,

Romberg³

2021

Preprint

View full text Add to dashboard Cite

We study a novel two-time-scale stochastic gradient method for solving optimization problems where the gradient samples are generated from a time-varying Markov random process parameterized by the underlying optimization variable. These time-varying samples make the stochastic gradient biased and dependent, which can potentially lead to the divergence of the iterates. To address this issue, we consider a two-time-scale update scheme, where one scale is used to estimate the true gradient from the Markovian samples and the other scale is used to update the decision variable with the estimated gradient. While these two iterates are implemented simultaneously, the former is updated "faster" (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different objective functions, namely, strong convexity, convexity, non-convexity under the PŁ condition, and general non-convexity.Our second contribution is to apply our framework to study the performance of the popular actor-critic methods in solving stochastic control and reinforcement learning problems. First, we study an online natural actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of Opk ´2{3 q is achieved. This is the first time such a result is known in the literature. Second, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of Opk ´2{5 q, which recovers the best known rate derived specifically for this problem. Finally, we support our theoretical analysis with numerical simulations where the convergence rate is visualized.

show abstract

“…The finite-time error bounds for the gradient TD algorithms [Maei et al, 2010, Maei et al, 2010 were further developed recently in [Dalal et al, 2018b, Liu et al, 2015, Gupta et al, 2019, Xu et al, 2019, Dalal et al, 2020, Kaledin et al, 2020, Wang and Zou, 2020, Ma et al, 2021. There are also finite-time error bounds on the policy gradient methods and actor critic methods, e.g., , Kumar et al, 2019, Qiu et al, 2019, Wu et al, 2020, Cen et al, 2020, Bhandari and Russo, 2019, Agarwal et al, 2019, Mei et al, 2020. We note that these studies are for the non-robust RL algorithms, and in this paper, we design robust RL algorithms, and characterize their finite-time error bounds.…”

Section: Related Workmentioning

confidence: 99%

Online Robust Reinforcement Learning with Model Uncertainty

Wang¹,

Zou²

2021

Preprint

View full text Add to dashboard Cite

Robust reinforcement learning (RL) is to find a policy that optimizes the worstcase performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust RL, where the uncertainty set is defined to be centering at a misspecified MDP that generates a single sample trajectory sequentially, and is assumed to be unknown. We develop a sample-based approach to estimate the unknown uncertainty set, and design robust Q-learning algorithm (tabular case) and robust TDC algorithm (function approximation setting), which can be implemented in an online and incremental fashion. For the robust Q-learning algorithm, we prove that it converges to the optimal robust Q function, and for the robust TDC algorithm, we prove that it converges asymptotically to some stationary points. Unlike the results in [Roy et al., 2017], our algorithms do not need any additional conditions on the discount factor to guarantee the convergence. We further characterize the finite-time error bounds of the two algorithms, and show that both the robust Qlearning and robust TDC algorithms converge as fast as their vanilla counterparts (within a constant factor). Our numerical experiments further demonstrate the robustness of our algorithms. Our approach can be readily extended to robustify many other algorithms, e.g., TD, SARSA, and other GTD algorithms.

show abstract

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Cited by 29 publications

References 21 publications

Recent Advances in Reinforcement Learning in Finance

Recent Advances in Reinforcement Learning in Finance

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Online Robust Reinforcement Learning with Model Uncertainty

Contact Info

Product

Resources

About