A Learning Algorithm for Risk-Sensitive Cost

Basu, Arnab; Bhattacharyya, Tirthankar; Borkar, Vivek S.

doi:10.1287/moor.1080.0324

Cited by 48 publications

(70 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…and we know that it is much easier to design actor-critic or other reinforcement learning algorithms (Borkar 2001(Borkar , 2002Basu et al 2008;Borkar 2010) for this risk measure than those that will be presented in this paper. However, this formulation is limited in the sense that it requires knowing the ideal tradeoff between the mean and variance, since it takes β as an input.…”

Section: Simulation Experimentsmentioning

confidence: 99%

“…Most of the work on this topic (including those mentioned above) has been in the context of MDPs (when the model of the system is known) and much less work has been done within the reinforcement learning (RL) framework (when the model is unknown and all the information about the system is obtained from the samples resulted from the agent's interaction with the environment). In risk-sensitive RL, we can mention the work by Borkar (2001Borkar ( , 2002Borkar ( , 2010 and Basu et al (2008) who considered the expected exponential utility, the one by Mihatsch and Neuneier (2002) that formulated a new risk-sensitive control framework based on transforming the temporal difference errors that occur during learning, and the one by Tamar et al (2012) on several variance related measures. Tamar et al (2012) study stochastic shortest path problems, and in this context, propose a policy gradient algorithm [and in a more recent work (Tamar and Mannor 2013) an actor-critic algorithm] for maximizing several risk-sensitive criteria that involve both the expectation and variance of the return random variable (defined as the sum of the rewards that the agent obtains in an episode).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Prashanth

Ghavamzadeh

2016

Mach Learn

View full text Add to dashboard Cite

In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms that operate on three timescales-a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.

show abstract

Section: Simulation Experimentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Prashanth

Ghavamzadeh

2016

Mach Learn

View full text Add to dashboard Cite

show abstract

“…The discrete-time partial observation problem was solved by Whittle in [33] (see also [34]). For infinite horizon criterion in a Markovian setting, the reader can consult [5], [9], [10]. An important relation with robust controllers was found in [14], [15], whereas the risk-sensitive maximum principle was studied in [26], [27], [17], [20].…”

Section: Introductionmentioning

confidence: 99%

Risk-sensitive control for a class of nonlinear systems with multiplicative noise

Date

Gashi

2013

Systems & Control Letters

View full text Add to dashboard Cite

In this paper, we consider the problem of optimal control for a class of nonlinear stochastic systems with multiplicative noise. The nonlinearity consists of quadratic terms in the state and control variables. The optimality criteria are of a risk-sensitive and generalised risk-sensitive type. The optimal control is found in an explicit closed-form by the completion of squares and the change of measure methods. As applications, we outline two special cases of our results. We show that a subset of the class of models which we consider leads to a generalized quadratic affine term structure model (QATSM) for interest rates. We also demonstrate how our results lead to generalisation of exponential utility as a criterion in optimal investment.

show abstract

“…Their optimality, however, is based on the expected discounted rewards. In this paper, we focus on the compound return 1 . The aim of this research is to maximize the compound return by extending the RL framework.…”

Section: Introductionmentioning

confidence: 99%

“…Averagereward RL [6,12,13,15] maximizes the arithmetic average rewards in reward-based MDPs. Risk-sensitive RL [1,2,5,7,9,11] not only maximizes the sum of expected discounted rewards, it also minimizes the risk defined by each study. While they can learn risk-averse behavior, they do not take into account maximizing the compound return.…”

Section: Introductionmentioning

confidence: 99%

Compound Reinforcement Learning: Theory and an Application to Finance

Matsui

Goto

Izumi

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper describes compound reinforcement learning (RL) that is an extended RL based on the compound return. Compound RL maximizes the logarithm of expected double-exponentially discounted compound return in returnbased Markov decision processes (MDPs). The contributions of this paper are (1) Theoretical description of compound RL that is an extended RL framework for maximizing the compound return in a return-based MDP and (2) Experimental results in an illustrative example and an application to finance.

show abstract

A Learning Algorithm for Risk-Sensitive Cost

Cited by 48 publications

References 28 publications

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Variance-constrained actor-critic algorithms for discounted and average reward MDPs

Risk-sensitive control for a class of nonlinear systems with multiplicative noise

Compound Reinforcement Learning: Theory and an Application to Finance

Contact Info

Product

Resources

About