2014
DOI: 10.1080/03081079.2014.883387
|View full text |Cite
|
Sign up to set email alerts
|

Variance-penalized Markov decision processes: dynamic programming and reinforcement learning techniques

Abstract: In control systems theory, the Markov decision process (MDP) is a widely used optimization model involving selection of the optimal action in each state visited by a discrete-event system driven by Markov chains. The classical MDP model is suitable for an agent/decisionmaker interested in maximizing expected revenues, but does not account for minimizing variability in the revenues. An MDP model in which the agent can maximize the revenues while simultaneously controlling the variance in the revenues is propose… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(10 citation statements)
references
References 28 publications
(20 reference statements)
0
10
0
Order By: Relevance
“…These problems include, but not limited to, the risk-averse variance and mean-variance optimizations in discounted and average MDPs, and these four metrics can be covered by ( 6). When the discount factor α ↑ 1, the problem turns into the mean-variance maximization in average MDPs (Xia 2020, Gosavi 2014) (see Remark 1). When the risk-aversion parameter β is large enough with respect to the mean, the problem degrades to the variance minimization problem (for average MDPs, see (Xia 2016)).…”
Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning
confidence: 99%
See 2 more Smart Citations
“…These problems include, but not limited to, the risk-averse variance and mean-variance optimizations in discounted and average MDPs, and these four metrics can be covered by ( 6). When the discount factor α ↑ 1, the problem turns into the mean-variance maximization in average MDPs (Xia 2020, Gosavi 2014) (see Remark 1). When the risk-aversion parameter β is large enough with respect to the mean, the problem degrades to the variance minimization problem (for average MDPs, see (Xia 2016)).…”
Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning
confidence: 99%
“…2. For the unified set of problems, a set of algorithms can be developed and analyzed in the framework, such as the policy gradients (Prashanth andGhavamzadeh 2013, Bisi et al 2020), the policy iterations (Xia 2016, 2020, Zhang et al 2021, and the value iteration (Gosavi 2014). The missing convergence analyses in some previous works can be developed as well.…”
Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning
confidence: 99%
See 1 more Smart Citation
“…Unlike DP, RL does not require an explicit model of the behavior of the system. This, however, does not imply that RL techniques cannot take advantage of a system behavior model [6], [1].…”
Section: Fuzzy Q-learning: Theory a Markov Decision Processmentioning
confidence: 99%
“…In motion planning however, the path has features that influence the optimal path to the end-point pose. The following equations ( 1,2) show the differences between Finite and Infinite Horizon spaces.…”
Section: Fuzzy Q-learning: Theory a Markov Decision Processmentioning
confidence: 99%