2018
DOI: 10.48550/arxiv.1807.11274
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces

Abstract: Reinforcement learning consists of finding policies that maximize an expected cumulative long term reward in a Markov decision process with unknown transition probabilities and instantaneous rewards. In this paper we consider the problem of finding such optimal policies while assuming they are continuous functions belonging to a reproducing kernel Hilbert space (RKHS). To learn the optimal policy we introduce a stochastic policy gradient ascent algorithm with three unique novel features: (i) The stochastic est… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
22
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4

Relationship

4
0

Authors

Journals

citations
Cited by 4 publications
(22 citation statements)
references
References 14 publications
0
22
0
Order By: Relevance
“…In this remark we discuss the equivalence between the formulations in (1) and (2). This discussion is inspired in [5, Section 2.3] and in the proofs of [32,Proposition 2 and 3]. Let us start by considering the finite horizon value function in (1) with a horizon chosen from a geometric distribution with parameter γ ∈ (0, 1).…”
Section: Problem Formulationmentioning
confidence: 99%
See 2 more Smart Citations
“…In this remark we discuss the equivalence between the formulations in (1) and (2). This discussion is inspired in [5, Section 2.3] and in the proofs of [32,Proposition 2 and 3]. Let us start by considering the finite horizon value function in (1) with a horizon chosen from a geometric distribution with parameter γ ∈ (0, 1).…”
Section: Problem Formulationmentioning
confidence: 99%
“…Under mild assumptions on the reward function it is possible to exchange the sum and the expectation (see e.g., [32,Proposition 2] ). Also assuming that the horizon is drawn independently from the trajectory, we can write…”
Section: Problem Formulationmentioning
confidence: 99%
See 1 more Smart Citation
“…Overall, the resulting procedure takes the form shown in Algorithm 1. At time k, the agent advances the system by T ∼ Geom(γ) steps and thus obtains a sample s k from the occupancy measure µ s k [26]. Similarly, in order to obtain estimates of ∇V λ k s k (θ k ) and U s k (θ k ), the agent can do so by accumulating the rewards r(s t , a t ) and constraint satisfaction 1(s ∈ S safe ) by performing another rollout of T Q ∼ Geom(γ) time steps.…”
mentioning
confidence: 99%
“…Since both series are convergent (by hypothesis, otherwise ρ z (s) would not be a proper probability measure), the summations in (26) can be rearranged to get…”
mentioning
confidence: 99%