Abstract:Abstract-This paper describes a novel multi-objective reinforcement learning algorithm. The proposed algorithm first learns a model of the multi-objective sequential decision making problem, after which this learned model is used by a multiobjective dynamic programming method to compute Pareto optimal policies. The advantage of this model-based multi-objective reinforcement learning method is that once an accurate model has been estimated from the experiences of an agent in some environment, the dynamic progra… Show more
“…Otherwise an action is selected randomly. This has been the predominant exploration approach adopted in the MORL literature so far [12,15,16,19,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45].…”
Section: Exploration In Multiobjective Rlmentioning
confidence: 99%
“…The DST has been widely adopted as a benchmark (e.g. [15,32,43,44]). The agent controls a submarine which starts from a location near the shore and travels out to sea to retrieve treasure.…”
Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vectorvalued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax-epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation.
“…Otherwise an action is selected randomly. This has been the predominant exploration approach adopted in the MORL literature so far [12,15,16,19,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45].…”
Section: Exploration In Multiobjective Rlmentioning
confidence: 99%
“…The DST has been widely adopted as a benchmark (e.g. [15,32,43,44]). The agent controls a submarine which starts from a location near the shore and travels out to sea to retrieve treasure.…”
Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vectorvalued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax-epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation.
“…Our current implementation uses a very simple scalarization method to solve a multi-objective problem. There are many techniques designed to allow agents to more easily solve multi-objective problems [33], some of which might be used to enhance the performance of our controller. [34] Currently, our reward is a linear combination of a set of soft constraints, multiplied by the AND-operation of all hard constraints.…”
Section: Multi Objective Reinforcement Learningmentioning
confidence: 99%
“…Wiering et aluse a two-stage approach to learn the set of optimal policies [33] that are applicable in the Deep Sea Treasure problem. First, an agent explores the environment, attempting to explore and learn a model of the environment.…”
Section: Multi Objective Reinforcement Learningmentioning
Recent advances in the field of Neural Architecture Search (NAS) have made it possible to develop state-of-the-art deep learning systems without requiring extensive human expertise and hyperparameter tuning. In most previous research, little concern was given to the resources required to run the generated systems. In this paper, we present an improvement on a recent NAS method, Efficient Neural Architecture Search (ENAS). We adapt ENAS to not only take into account the network's performance, but also various constraints that would allow these networks to be ported to embedded devices. Our results show ENAS' ability to comply with these added constraints. In order to show the efficacy of our system, we demonstrate it by designing a Recurrent Neural Network that predicts words as they are spoken, and meets the constraints set out for operation on an embedded device, along with a Convolutional Neural Network, capable of classifying 32x32 RGB images at a rate of 1 FPS on an embedded device.
“…However, the planning methods and learning methods are not entirely disjoint; when the agent explicitly learns a model of the environment through its interaction, it can use a planning method in order to produce a coverage set. Such model-based learning has been investigated extensively in single-objective settings, and has recently been introduced to multi-objective settings as well (Wiering et al, 2014). As such, the methods proposed in this dissertation can be employed as planning subroutines inside a model-based learning algorithm.…”
Decision making is hard. It o en requires reasoning about uncertain environments, partial observability and action spaces that are too large to enumerate. In such complex decisionmaking tasks decision-theoretic agents, that can reason about their environments on the basis of mathematical models and produce policies that optimize the utility for their users, can o en assist us.In most research on decision-theoretic agents, the desirability of actions and their e ects is codi ed in a scalar reward function. However, many real-world decision problems have multiple objectives. In such cases the problem is more naturally expressed using a vector-valued reward function. Rather than having a single optimal policy, we then want to produce a set of policies that covers all possible preferences between the objectives. We call such a set a coverage set. In this dissertation, we focus on decision-theoretic planning algorithms that produce the convex coverage set (CCS), which is the optimal solution set when either: 1) the user utility can be expressed as a weighted sum over the values for each objective; or 2) policies can be stochastic.We propose new methods based on two popular approaches to creating planning algorithms that produce an (approximate) CCS by building on an existing single-objective algorithm. In the inner loop approach, we replace the summations and maximizations in the inner most loops of the single-objective algorithm by cross-sums and pruning operations. In the outer loop approach, we solve a multi-objective problem as a series of scalarized problems by employing the single-objective method as a subroutine.Our most important contribution is an outer loop framework that we call optimistic linear support (OLS). As an outer loop method OLS builds the CCS incrementally. We show that, contrary to existing outer loop methods, each intermediate result is a bounded approximation of the CCS with known bounds (even when the single-objective method used is a bounded approximate method as well) and is guaranteed to terminate in a nite number of iterations.We apply OLS-based algorithms to a variety of multi-objective decision problems, and show that it is more memory-e cient, and faster than corresponding inner loop algorithms for moderate numbers of objectives. We show that exchanging subroutines in OLS is relatively easy and illustrate the importance on a complex planning problem. Finally, we show that it is o en possible to reuse parts of the policies and values, found in earlier iterations of OLS, to hot-start later iterations of OLS. Using this last insight, we propose the rst method for multi-objective POMDPs that employs point-based planning and can produce an ε-CCS in reasonable time.Overall, the methods we propose bring us closer to truly practical multi-objective decisiontheoretic planning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.