A temporal difference method for multi-objective reinforcement learning

Ruiz-Montiel, Manuela; Mandow, Lawrence; Pérez-de-la-Cruz, José-Luís

doi:10.1016/j.neucom.2016.10.100

Cited by 27 publications

(26 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A limit of 700 actions per episode was set. We used the multiobjective -greedy behavior policy described by [3]. This basically calculates the ratio of nondominated vectors that each Q(s, a) contributes to V(s).…”

Section: Resultsmentioning

confidence: 99%

“…MPQ-learning [3] is an extension of Q-learning to multiobjective problems. The goal is to obtain the set of all Pareto-optimal deterministic policies.…”

Section: Mpq-learning 21 the Algorithmmentioning

confidence: 99%

“…A detailed description of MPQ-learning can be found in [3]. Let us briefly describe here the rationale of the algorithm in an intuitive way.…”

Section: Mpq-learning 21 the Algorithmmentioning

confidence: 99%

“…Notice that if transition s 1 , a, r, s 3 is performed afterwards for the first time, and there is just one optimal estimate in V(s 3 ) = {q 3 1 }, then q 1 1 would be updated from q 3…”

Section: Mpq-learning 21 the Algorithmmentioning

confidence: 99%

“…The later try to approximate part or the whole set of Pareto optimal policies. There are currently very few multi-policy MORL algorithms; among them we will consider in this paper MPQ-learning [3]. It is an off-policy temporal-difference method.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Pruning Dominated Policies in Multiobjective Pareto Q-Learning

Mandow

Pérez-de-la-Cruz

2018

Advances in Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

The solution for a Multi-Objetive Reinforcement Learning problem is a set of Pareto optimal policies. MPQ-learning is a recent algorithm that approximates the whole set of all Pareto-optimal deterministic policies by directly generalizing Q-learning to the multiobjective setting. In this paper we present a modification of MPQ-learning that avoids useless cyclical policies and thus improves the number of training steps required for convergence.

show abstract

Section: Resultsmentioning

confidence: 99%

“…MPQ-learning [3] is an extension of Q-learning to multiobjective problems. The goal is to obtain the set of all Pareto-optimal deterministic policies.…”

Section: Mpq-learning 21 the Algorithmmentioning

confidence: 99%

“…A detailed description of MPQ-learning can be found in [3]. Let us briefly describe here the rationale of the algorithm in an intuitive way.…”

Section: Mpq-learning 21 the Algorithmmentioning

confidence: 99%

“…Notice that if transition s 1 , a, r, s 3 is performed afterwards for the first time, and there is just one optimal estimate in V(s 3 ) = {q 3 1 }, then q 1 1 would be updated from q 3…”

Section: Mpq-learning 21 the Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Pruning Dominated Policies in Multiobjective Pareto Q-Learning

Mandow

Pérez-de-la-Cruz

2018

Advances in Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

show abstract

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Vamplew

Foale

Dazeley

2021

Neural Comput & Applic

View full text Add to dashboard Cite

A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return, and show that this may require different solutions than ESR or conventional SER.The analysis of the interaction between environmental stochasticity and multiobjective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions.

show abstract

A practical guide to multi-objective reinforcement learning and planning

Hayes

Rădulescu

Bargiacchi

et al. 2022

Auton Agent Multi-Agent Syst

148

View full text Add to dashboard Cite

Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems.

show abstract

A temporal difference method for multi-objective reinforcement learning

Cited by 27 publications

References 15 publications

Pruning Dominated Policies in Multiobjective Pareto Q-Learning

Pruning Dominated Policies in Multiobjective Pareto Q-Learning

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

A practical guide to multi-objective reinforcement learning and planning

Contact Info

Product

Resources

About