2020
DOI: 10.1609/aaai.v34i04.5779
|View full text |Cite
|
Sign up to set email alerts
|

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Abstract: Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, θn and wn, which are updated using two distinct stepsize sequences, αn and βn, respectively. Assuming αn = n−α and βn = n−β with 1 > α > β > 0, we show that, with high probability, the two … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
31
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 31 publications
(32 citation statements)
references
References 9 publications
1
31
0
Order By: Relevance
“…Specifically, it would be interesting to extend weak convergence rate analysis of (Konda and Tsitsiklis 2004;Mokkadem and Pelletier 2006) to Three-TS regime. Also, extending the recent convergence rate results in expectation and high probability of GTD methods (Dalal et al 2018b;Gupta, Srikant, and Ying 2019;Kaledin et al 2019;Dalal, Szorenyi, and Thoppe 2020) to these momentum settings would be interesting works for the future.…”
Section: Related Work and Conclusionmentioning
confidence: 85%
“…Specifically, it would be interesting to extend weak convergence rate analysis of (Konda and Tsitsiklis 2004;Mokkadem and Pelletier 2006) to Three-TS regime. Also, extending the recent convergence rate results in expectation and high probability of GTD methods (Dalal et al 2018b;Gupta, Srikant, and Ying 2019;Kaledin et al 2019;Dalal, Szorenyi, and Thoppe 2020) to these momentum settings would be interesting works for the future.…”
Section: Related Work and Conclusionmentioning
confidence: 85%
“…Another intriguing future direction is the setting with dynamic communication protocols, wherein the gossip matrix also evolves with time Doan et al [2019Doan et al [ , 2021. A third direction is that of twotimescale DSA schemes Dalal et al [2018bDalal et al [ , 2020. On the MARL side, important algorithms like distributed Q-learning Lauer and Riedmiller [2000] and its variants need more careful analysis and we believe our techniques would be instrumental for this as well.…”
Section: Discussionmentioning
confidence: 99%
“…The convergence of the two-time-scale SA is traditionally established through the analysis of an associated ordinary differential equation (ODE) [2]. Finite-time convergence of the two-time-scale SA has been studied for both linear [4,5,7,11,12,14,16,19] (when F and G are linear) and nonlinear settings [6,8,24], under both i.i.d. and Markovian samples.…”
Section: Related Workmentioning
confidence: 99%