2022
DOI: 10.48550/arxiv.2205.05800
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Stochastic first-order methods for average-reward Markov decision processes

Abstract: We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 28 publications
0
4
0
Order By: Relevance
“…As we have discussed in Section 1, there have been two main approaches in the literature of online PG methods for addressing (2.4). One approach forces (2.3) with myopic exploration, with either ǫ-exploration [4,17], or policy perturbation [26]. A drawback of myopic exploration, aside from another layer of complexity in implementation, is its repeating sample of non-optimal or high-risk actions, even if the policy has identified these actions.…”
Section: Stochastic Policy Mirror Descentmentioning
confidence: 99%
See 2 more Smart Citations
“…As we have discussed in Section 1, there have been two main approaches in the literature of online PG methods for addressing (2.4). One approach forces (2.3) with myopic exploration, with either ǫ-exploration [4,17], or policy perturbation [26]. A drawback of myopic exploration, aside from another layer of complexity in implementation, is its repeating sample of non-optimal or high-risk actions, even if the policy has identified these actions.…”
Section: Stochastic Policy Mirror Descentmentioning
confidence: 99%
“…It is worth noting that the two-step construction of VBE-I also makes it feasible to utilize other online policy evaluation operator for constructing V π,ξV with linearly converging bias. This is particularly helpful as it can allow the policy evaluation step to incorporate simple function approximation for problems with large state spaces (e.g., linear approximation, [19,26]). This also seems to be a unique feature of VBE-I operator compared to other policy evaluation operators studied in this manuscript.…”
Section: Vbe-imentioning
confidence: 99%
See 1 more Smart Citation
“…In this section, we study the problem of optimal control, which aims to find a policy that optimizes the robust averagereward: π * = arg max π g π P . Similar to Assumption 1, we make the following assumption to guarantee that the average-reward is independent of the initial state (Abounadi et al, 2001;Li et al, 2022). Assumption 4.…”
Section: Robust Rvi Q-learning For Controlmentioning
confidence: 99%