A Convergent Off-Policy Temporal Difference Algorithm

Diddigi, Raghuram Bharadwaj; Kamanchi, Chandramouli; Bhatnagar, Shalabh

doi:10.48550/arxiv.1911.05697

Cited by 1 publication

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where Θ is the set of all deterministic policies, and ⊗ is the Kronecker product. Theorem A.2, given in Appendix A.2 of the supplemental material, is similar to Theorem 2 in Diddigi et al, 2019, and ensures such a property. Now, we can use Theorem 2.8 to establish stability of the overall system.…”

Section: Algorithmmentioning

confidence: 96%

“…To address this problem, Diddigi et al, 2019 adds a regularization term to TD-learning in order to make it convergent. Since Q-learning can be interpreted as an off-policy TD-learning, we add a regularization term to Q-learning update motivated by Diddigi et al, 2019. This modification leads to the proposed algorithm as follows:…”

Section: Algorithmmentioning

confidence: 99%

“…Note that letting η = 0, the above update is reduced to the standard Q-learning with linear function approximation in (3). The proposed algorithm is different from Diddigi et al, 2019 in the sense that a regularization term is applied to Q-learning instead of TD-learning.…”

Section: Algorithmmentioning

confidence: 99%

“…Sutton et al, 2016 re-weights some states to match on-policy distribution to stabilize the off-policy TDlearning. Diddigi et al, 2019 uses l 2 regularization to propose a new convergent off-policy TD-learning algorithm.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Regularized Q-learning

Lim¹,

Kim²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

Q-learning is widely used algorithm in reinforcement learning community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. We prove its stability using a recent analysis tool based on switching system models. Moreover, we experimentally show that it converges in environments where Q-learning with linear function approximation has known to diverge. We also provide an error bound on the solution where the algorithm converges.

show abstract

Section: Algorithmmentioning

confidence: 96%

Section: Algorithmmentioning

confidence: 99%

Section: Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations