2020
DOI: 10.48550/arxiv.2010.08443
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes

Abstract: Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 17 publications
0
1
0
Order By: Relevance
“…While Theorem 1 provides an explicit bound on the admissible difference between occupation measures starting from different states ∆ s0,s k TV , it does not inform when these occupation measures are similar. Using a different argument, [28] shows that the inner products in ( 13) can be made positive by restricting the policy parametrization. In contrast, using the occupation measure argument from Theorem 1, we provide conditions of the underlying dynamical system and control problem (discount factor γ) that are independent of the parametrization used for the policy.…”
mentioning
confidence: 99%
“…While Theorem 1 provides an explicit bound on the admissible difference between occupation measures starting from different states ∆ s0,s k TV , it does not inform when these occupation measures are similar. Using a different argument, [28] shows that the inner products in ( 13) can be made positive by restricting the policy parametrization. In contrast, using the occupation measure argument from Theorem 1, we provide conditions of the underlying dynamical system and control problem (discount factor γ) that are independent of the parametrization used for the policy.…”
mentioning
confidence: 99%