2021
DOI: 10.1007/s00521-021-05859-1
|View full text |Cite
|
Sign up to set email alerts
|

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Abstract: A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It sho… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(12 citation statements)
references
References 30 publications
0
12
0
Order By: Relevance
“…In order to apply RL to the real world the MORL community must consider the ESR criterion. However, the ESR criterion has largely been ignored by the MORL community, with the exception of the works of Roijers et al [31,30], Hayes et al [17,16] and Vamplew et al [39]. The works of Hayes et al [16,17] and Roijers et al [30] present single-policy algorithms that are suitable to learn policies under the ESR criterion, however, prior to this work, a formal definition of the necessary requirements to satisfy the ESR criterion had not previously been defined.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In order to apply RL to the real world the MORL community must consider the ESR criterion. However, the ESR criterion has largely been ignored by the MORL community, with the exception of the works of Roijers et al [31,30], Hayes et al [17,16] and Vamplew et al [39]. The works of Hayes et al [16,17] and Roijers et al [30] present single-policy algorithms that are suitable to learn policies under the ESR criterion, however, prior to this work, a formal definition of the necessary requirements to satisfy the ESR criterion had not previously been defined.…”
Section: Discussionmentioning
confidence: 99%
“…The current MORL literature focuses only on methods which learn the optimal set of policies under the SER criterion [23,41]. As already highlighted, the ESR criterion has largely been ignored by the MORL community, with a few exceptions [30,17,39]. In Section 6 we address this research gap and we present a novel multi-objective tabular distributional reinforcement learning (MOTDRL) algorithm that learns an optimal set of policies for the ESR criterion, also known as the ESR set, for multi-objective multi-armed bandit (MOMAB) problems.…”
Section: Multi-objective Tabular Distributional Reinforcement Learningmentioning
confidence: 99%
“…In the seventh paper, The impact of environmental stochasticity on value-based multiobjective reinforcement learning [7], Vamplew et al analyse the role of stochastic rewards and stochastic state transitions in multi-objective Q-Learning. Most of the previous empirical evaluations of multi-objective reinforcement learning and scalarisation methods assume that environments are deterministic.…”
Section: Contents Of the Special Issuementioning
confidence: 99%
“…Most of the previous empirical evaluations of multi-objective reinforcement learning and scalarisation methods assume that environments are deterministic. In [7], the authors find that stochasticity in rewards/transitions affects the optimal solution agents which can learn and, importantly, introduce important differences based on the choice of optimisation criterion (e.g. expected scalarised returns or scalarised expected returns).…”
Section: Contents Of the Special Issuementioning
confidence: 99%
“…Nevertheless, this violates the assumption of additive returns in the Bellman equation at the heart of these algorithms [Roijers et al, 2013], and therefore it may be necessary to condition the Q-values and the agent's choice of action on an augmented state formed by concatenating the environmental state with the summed rewards previously received by the agent [Geibel, 2006]. Additionally these approaches may fail to converge to the optimal policy in environments with stochastic state transitions [Vamplew et al, 2021a].…”
Section: Single-policy Algorithmsmentioning
confidence: 99%