2017
DOI: 10.1016/j.neucom.2016.10.100
|View full text |Cite
|
Sign up to set email alerts
|

A temporal difference method for multi-objective reinforcement learning

Abstract: This work describes MPQ-learning, an temporal-difference method that approximates the set of all non-dominated policies in multi-objective Markov decision problems, where rewards are vectors and each component stands for an objective to maximize. Unlike other approximations to Multi-objective Reinforcement Learning, MPQ-learning does not require additional parameters or preference information, and can be applied to non-convex Pareto frontiers. We also present the results of the application of MPQ-learning to s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 27 publications
(26 citation statements)
references
References 15 publications
0
26
0
Order By: Relevance
“…A limit of 700 actions per episode was set. We used the multiobjective -greedy behavior policy described by [3]. This basically calculates the ratio of nondominated vectors that each Q(s, a) contributes to V(s).…”
Section: Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…A limit of 700 actions per episode was set. We used the multiobjective -greedy behavior policy described by [3]. This basically calculates the ratio of nondominated vectors that each Q(s, a) contributes to V(s).…”
Section: Resultsmentioning
confidence: 99%
“…MPQ-learning [3] is an extension of Q-learning to multiobjective problems. The goal is to obtain the set of all Pareto-optimal deterministic policies.…”
Section: Mpq-learning 21 the Algorithmmentioning
confidence: 99%
See 3 more Smart Citations