“…However, even though no specialized methods are needed to address this setting, it is nonetheless the most commonly studied setting for MORL. Linear scalarization with uniform weights, i.e., all the elements of w are equal, forms the basis of the work of Karlsson (1997), Ferreira, Bianchi, and Ribeiro (2012), Aissani, Beldjilali, and Trentesaux (2008) and Shabani (2009) amongst others, while non-uniform weights have been used by authors such as Castelletti et al (2002), Guo et al (2009) andPerez et al (2009). The majority of this work uses TD methods, which work on-line, although Castelletti et al (2010) extend off-line Fitted Q-Iteration (Ernst, Geurts, & Wehenkel, 2005) to multiple objectives.…”
Sequential decision-making problems with multiple objectives arise naturally
in practice and pose unique challenges for research in decision-theoretic
planning and learning, which has largely focused on single-objective settings.
This article surveys algorithms designed for sequential decision-making
problems with multiple objectives. Though there is a growing body of literature
on this subject, little of it makes explicit under what circumstances special
methods are needed to solve multi-objective problems. Therefore, we identify
three distinct scenarios in which converting such a problem to a
single-objective one is impossible, infeasible, or undesirable. Furthermore, we
propose a taxonomy that classifies multi-objective methods according to the
applicable scenario, the nature of the scalarization function (which projects
multi-objective values to scalar ones), and the type of policies considered. We
show how these factors determine the nature of an optimal solution, which can
be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we
survey the literature on multi-objective methods for planning and learning.
Finally, we discuss key applications of such methods and outline opportunities
for future work
“…However, even though no specialized methods are needed to address this setting, it is nonetheless the most commonly studied setting for MORL. Linear scalarization with uniform weights, i.e., all the elements of w are equal, forms the basis of the work of Karlsson (1997), Ferreira, Bianchi, and Ribeiro (2012), Aissani, Beldjilali, and Trentesaux (2008) and Shabani (2009) amongst others, while non-uniform weights have been used by authors such as Castelletti et al (2002), Guo et al (2009) andPerez et al (2009). The majority of this work uses TD methods, which work on-line, although Castelletti et al (2010) extend off-line Fitted Q-Iteration (Ernst, Geurts, & Wehenkel, 2005) to multiple objectives.…”
Sequential decision-making problems with multiple objectives arise naturally
in practice and pose unique challenges for research in decision-theoretic
planning and learning, which has largely focused on single-objective settings.
This article surveys algorithms designed for sequential decision-making
problems with multiple objectives. Though there is a growing body of literature
on this subject, little of it makes explicit under what circumstances special
methods are needed to solve multi-objective problems. Therefore, we identify
three distinct scenarios in which converting such a problem to a
single-objective one is impossible, infeasible, or undesirable. Furthermore, we
propose a taxonomy that classifies multi-objective methods according to the
applicable scenario, the nature of the scalarization function (which projects
multi-objective values to scalar ones), and the type of policies considered. We
show how these factors determine the nature of an optimal solution, which can
be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we
survey the literature on multi-objective methods for planning and learning.
Finally, we discuss key applications of such methods and outline opportunities
for future work
“…Otherwise an action is selected randomly. This has been the predominant exploration approach adopted in the MORL literature so far [12,15,16,19,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45].…”
Section: Exploration In Multiobjective Rlmentioning
Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vectorvalued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax-epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation.
“…An advantage of this approach is that the base policies can be found using simple methods. For example, linearly scalarised temporal difference learning has been widely used to find LDS policies for MORL tasks [20,21,22]. Linear scalarisation takes a weighted sum of the rewards, converting the problem to a single-objective MDP so standard TD-based methods can be used [5].…”
Section: Learning Stochastic or Non-stationary Multiobjective Policiesmentioning
For reinforcement learning tasks with multiple objectives, it may be advantageous to learn stochastic or non-stationary policies. This paper investigates two novel algorithms for learning non-stationary policies which produce Pareto-optimal behaviour (w-steering and Q-steering), by extending prior work based on the concept of geometric steering. Empirical results demonstrate that both new algorithms offer substantial performance improvements over stationary deterministic policies, while Q-steering significantly outperforms w-steering when the agent has no information about recurrent states within the environment. It is further demonstrated that Q-steering can be used interactively by providing a human decision-maker with a visualisation of the Pareto front and allowing them to adjust the agent's target point during learning. To demonstrate broader applicability, the use of Q-steering in combination with function approximation is also illustrated on a task involving control of local battery storage for a residential solar power system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.