2019
DOI: 10.48550/arxiv.1911.05697
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Convergent Off-Policy Temporal Difference Algorithm

Raghuram Bharadwaj Diddigi,
Chandramouli Kamanchi,
Shalabh Bhatnagar

Abstract: Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of offpolicy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(4 citation statements)
references
References 10 publications
0
4
0
Order By: Relevance
“…where Θ is the set of all deterministic policies, and ⊗ is the Kronecker product. Theorem A.2, given in Appendix A.2 of the supplemental material, is similar to Theorem 2 in Diddigi et al, 2019, and ensures such a property. Now, we can use Theorem 2.8 to establish stability of the overall system.…”
Section: Algorithmmentioning
confidence: 96%
See 3 more Smart Citations
“…where Θ is the set of all deterministic policies, and ⊗ is the Kronecker product. Theorem A.2, given in Appendix A.2 of the supplemental material, is similar to Theorem 2 in Diddigi et al, 2019, and ensures such a property. Now, we can use Theorem 2.8 to establish stability of the overall system.…”
Section: Algorithmmentioning
confidence: 96%
“…To address this problem, Diddigi et al, 2019 adds a regularization term to TD-learning in order to make it convergent. Since Q-learning can be interpreted as an off-policy TD-learning, we add a regularization term to Q-learning update motivated by Diddigi et al, 2019. This modification leads to the proposed algorithm as follows:…”
Section: Algorithmmentioning
confidence: 99%
See 2 more Smart Citations