2008
DOI: 10.1287/moor.1080.0324
|View full text |Cite
|
Sign up to set email alerts
|

A Learning Algorithm for Risk-Sensitive Cost

Abstract: A linear function approximation based reinforcement learning algorithm is proposed for Markov decision processes with infinite horizon risk-sensitive cost. Its convergence is proved using the 'o.d.e. method' for stochastic approximation. The scheme is also extended to continuous state space processes. 1. Introduction. Recent decades have seen a major activity in approximate dynamic programming for Markov decision processes based on real or simulated data, using reinforcement learning algorithms. (See, e.g., Be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
68
0

Year Published

2008
2008
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 48 publications
(70 citation statements)
references
References 28 publications
2
68
0
Order By: Relevance
“…and we know that it is much easier to design actor-critic or other reinforcement learning algorithms (Borkar 2001(Borkar , 2002Basu et al 2008;Borkar 2010) for this risk measure than those that will be presented in this paper. However, this formulation is limited in the sense that it requires knowing the ideal tradeoff between the mean and variance, since it takes β as an input.…”
Section: Simulation Experimentsmentioning
confidence: 99%
See 1 more Smart Citation
“…and we know that it is much easier to design actor-critic or other reinforcement learning algorithms (Borkar 2001(Borkar , 2002Basu et al 2008;Borkar 2010) for this risk measure than those that will be presented in this paper. However, this formulation is limited in the sense that it requires knowing the ideal tradeoff between the mean and variance, since it takes β as an input.…”
Section: Simulation Experimentsmentioning
confidence: 99%
“…Most of the work on this topic (including those mentioned above) has been in the context of MDPs (when the model of the system is known) and much less work has been done within the reinforcement learning (RL) framework (when the model is unknown and all the information about the system is obtained from the samples resulted from the agent's interaction with the environment). In risk-sensitive RL, we can mention the work by Borkar (2001Borkar ( , 2002Borkar ( , 2010 and Basu et al (2008) who considered the expected exponential utility, the one by Mihatsch and Neuneier (2002) that formulated a new risk-sensitive control framework based on transforming the temporal difference errors that occur during learning, and the one by Tamar et al (2012) on several variance related measures. Tamar et al (2012) study stochastic shortest path problems, and in this context, propose a policy gradient algorithm [and in a more recent work (Tamar and Mannor 2013) an actor-critic algorithm] for maximizing several risk-sensitive criteria that involve both the expectation and variance of the return random variable (defined as the sum of the rewards that the agent obtains in an episode).…”
Section: Introductionmentioning
confidence: 99%
“…The discrete-time partial observation problem was solved by Whittle in [33] (see also [34]). For infinite horizon criterion in a Markovian setting, the reader can consult [5], [9], [10]. An important relation with robust controllers was found in [14], [15], whereas the risk-sensitive maximum principle was studied in [26], [27], [17], [20].…”
Section: Introductionmentioning
confidence: 99%
“…Their optimality, however, is based on the expected discounted rewards. In this paper, we focus on the compound return 1 . The aim of this research is to maximize the compound return by extending the RL framework.…”
Section: Introductionmentioning
confidence: 99%
“…Averagereward RL [6,12,13,15] maximizes the arithmetic average rewards in reward-based MDPs. Risk-sensitive RL [1,2,5,7,9,11] not only maximizes the sum of expected discounted rewards, it also minimizes the risk defined by each study. While they can learn risk-averse behavior, they do not take into account maximizing the compound return.…”
Section: Introductionmentioning
confidence: 99%