Variance-penalized Markov decision processes: dynamic programming and reinforcement learning techniques

Gosavi, Abhijit

doi:10.1080/03081079.2014.883387

Cited by 17 publications

(10 citation statements)

References 28 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These problems include, but not limited to, the risk-averse variance and mean-variance optimizations in discounted and average MDPs, and these four metrics can be covered by ( 6). When the discount factor α ↑ 1, the problem turns into the mean-variance maximization in average MDPs (Xia 2020, Gosavi 2014) (see Remark 1). When the risk-aversion parameter β is large enough with respect to the mean, the problem degrades to the variance minimization problem (for average MDPs, see (Xia 2016)).…”

Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning

confidence: 99%

“…2. For the unified set of problems, a set of algorithms can be developed and analyzed in the framework, such as the policy gradients (Prashanth andGhavamzadeh 2013, Bisi et al 2020), the policy iterations (Xia 2016, 2020, Zhang et al 2021, and the value iteration (Gosavi 2014). The missing convergence analyses in some previous works can be developed as well.…”

Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning

confidence: 99%

“…With the ordinary differential equation approach, they prove the asymptotic local convergences of the algorithms. Gosavi (2014) proposes a model-free algorithm analogous to Q-learning for the mean-variance problem in average reinforcement learning (RL). The algorithm is validated with a numerical experiment, but the convergence analysis is missing.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A unified algorithm framework for mean-variance optimization in discounted Markov decision processes

Ma¹,

Ma²,

Li³

2022

Preprint

View full text Add to dashboard Cite

This paper studies the risk-averse mean-variance optimization in infinite-horizon discounted Markov decision processes (MDPs). The involved variance metric concerns reward variability during the whole process, and future deviations are discounted to their present values. This discounted mean-variance optimization yields a reward function dependent on a discounted mean, and this dependency renders traditional dynamic programming methods inapplicable since it suppresses a crucial property-time consistency. To deal with this unorthodox problem, we introduce a pseudo mean to transform the untreatable MDP to a standard one with a redefined reward function in standard form and derive a discounted mean-variance performance difference formula. With the pseudo mean, we propose a unified algorithm framework with a bilevel optimization structure for the discounted mean-variance optimization. The framework unifies a variety of algorithms for several variance-related problems including, but not limited to, risk-averse variance and mean-variance optimizations in discounted and average MDPs. Furthermore, the convergence analyses missing from the literature can be complemented with the proposed framework as well. Taking the value iteration as an example, we develop a discounted mean-variance value iteration algorithm and prove its convergence to a local optimum with the aid of a Bellman local-optimality equation. Finally, we conduct a numerical experiment on

show abstract

Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning

confidence: 99%

Section: Algorithm 3 Value Iteration Variants For Inner Optimization ...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A unified algorithm framework for mean-variance optimization in discounted Markov decision processes

Ma¹,

Ma²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Unlike DP, RL does not require an explicit model of the behavior of the system. This, however, does not imply that RL techniques cannot take advantage of a system behavior model [6], [1].…”

Section: Fuzzy Q-learning: Theory a Markov Decision Processmentioning

confidence: 99%

“…In motion planning however, the path has features that influence the optimal path to the end-point pose. The following equations ( 1,2) show the differences between Finite and Infinite Horizon spaces.…”

Section: Fuzzy Q-learning: Theory a Markov Decision Processmentioning

confidence: 99%

Enhanced robot learning using fuzzy Q-Learning & context-aware middleware

Phiri

Kubota

et al. 2016

2016 International Symposium on Micro-NanoMechatronics and Human Science (MHS)

View full text Add to dashboard Cite

In this paper we continue with previous work by the authors implementing context-aware middleware to accelerate robot learning from demonstration, LfD. Specifically, we apply Fuzzy Q-Learning, FQL, reinforcement learning strategy to enhance the learning experience of the robot. Typically, fuzzy techniques allow the robot to make decisions without the need for an exhaustive map of the world. FQL, approximates the observable configuration space allowing the robot to overcome the high dimension challenge of feature decomposition and navigation in a stochastic environment.

show abstract