Abstract-This paper describes a novel multi-objective reinforcement learning algorithm. The proposed algorithm first learns a model of the multi-objective sequential decision making problem, after which this learned model is used by a multiobjective dynamic programming method to compute Pareto optimal policies. The advantage of this model-based multi-objective reinforcement learning method is that once an accurate model has been estimated from the experiences of an agent in some environment, the dynamic programming method will compute all Pareto optimal policies. Therefore it is important that the agent explores the environment in an intelligent way by using a good exploration strategy. In this paper we have supplied the agent with two different exploration strategies and compare their effectiveness in estimating accurate models within a reasonable amount of time. The experimental results show that our method with the best exploration strategy is able to quickly learn all Pareto optimal policies for the Deep Sea Treasure problem. I. INTRODUCTION Reinforcement learning (RL) [1], [2]enables an autonomous agent to learn from its interactions with a particular environment that emits reward signals to the agent. The objective of the agent is to learn a policy that obtains the highest possible discounted cumulative reward intake. In this paper we consider value-based reinforcement learning, where the agent estimates a value function denoting the future reward intake and uses this value function to select actions. Many valuebased reinforcement learning algorithms have been proposed [2]. These algorithms can be divided into model-free and model-based reinforcement learning algorithms. Model-free methods such as Q-learning [3] update the Q-value function after each interaction with the environment without estimating a model. Model-based RL methods first learn to estimate a model of the environment and then use a dynamic programming algorithm to compute the policy. The advantage of model-based RL methods is that experiences of the agent are used more effectively, leading to faster convergence to optimal policies.Although traditionally reinforcement learning algorithms have been applied solely to single objective decision problems, during the last decade the amount of research on multiobjective problems has considerably increased [4], [5]. In multi-objective reinforcement learning (MORL), the reward function emits a reward vector instead of a single scalar reward, and the goal is to learn all Pareto optimal policies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.