Policy search in continuous action domains: An overview

Sigaud, Olivier; Stulp, Freek

doi:10.1016/j.neunet.2019.01.011

Cited by 61 publications

(47 citation statements)

References 98 publications

(142 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve tasks with long-term dependency or multi-level interaction patterns [5,6]. Recent works have suggested that several interesting and standout results can be induced by training multi-level hierarchical policy in a multi-task setup [8,25] or implementing hierarchical setting in sparse reward problems [23,34].…”

Section: Related Workmentioning

confidence: 99%

CoRide

Jin

Zhou

Zhang

et al. 2019

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

How to optimally dispatch orders to vehicles and how to trade off between immediate and future returns are fundamental questions for a typical ride-hailing platform. We model ride-hailing as a large-scale parallel ranking problem and study the joint decisionmaking task of order dispatching and fleet management in online ride-hailing platforms. This task brings unique challenges in the following four aspects. First, to facilitate a huge number of vehicles to act and learn efficiently and robustly, we treat each region cell as an agent and build a multi-agent reinforcement learning framework. Second, to coordinate the agents from different regions to achieve long-term benefits, we leverage the geographical hierarchy of the region grids to perform hierarchical reinforcement learning. Third, to deal with the heterogeneous and variant action space for joint order dispatching and fleet management, we design the action as the ranking weight vector to rank and select the specific order or the fleet management destination in a unified formulation. Fourth, to achieve the multi-scale ride-hailing platform, we conduct the decision-making process in a hierarchical way where a multihead attention mechanism is utilized to incorporate the impacts of neighbor agents and capture the key agent in each scale. The whole novel framework is named as CoRide. Extensive experiments based on multiple cities real-world data as well as analytic synthetic data demonstrate that CoRide provides superior performance in terms of platform revenue and user experience in the task of citywide hybrid order dispatching and fleet management over strong baselines.

show abstract

Section: Related Workmentioning

confidence: 99%

CoRide

Jin

Zhou

Zhang

et al. 2019

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…The state of this environment is continuous and defined by the position (x, y) of the particle, and the control actions are its velocities (ẋ,ẏ), then D S = D A = 2. The initial position of the particle is sampled from a spherical Gaussian distribution centered in the position (4,4). This task can be decomposed into two composable tasks, namely, reaching the position −2 in the x coordinate, and reaching the position −2 in the y coordinate.…”

Section: A Tasks Descriptionmentioning

confidence: 99%

“…Several algorithms have been proposed to improve sample efficiency of model-free deep RL by making a better use of the sample information (data-efficiency), obtaining more information from data (sample choice) and improving several times the policy with the same samples (sample reuse) [4]. Fig.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Esteban

Rozo

Caldwell

2019

2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

View full text Add to dashboard Cite

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their corresponding subtasks.

show abstract

“…The way standard approaches to RL work is through a combination of hill-climbing (gradient descent) and random exploration. For example, state-of-the-art deep reinforcement learning algorithms for learning continuous control, such as DDPG and related algorithms [Lillicrap et al, 2015, Schulman et al, 2017, Sigaud and Stulp, 2018, work by alternating between updating the current controller solution in order to climb the hill of rewards (this requires that rewards of different magnitudes are observed when slightly changing the controller), and producing random perturbations of the current best controller to obtain further information about the Figure 1. Curiosity-driven exploration through autonomous goal setting and self-organized curriculum learning in the experimental setup presented in [Forestier et al, 2017] (see video: https://www.youtube.com/watch?v=NOLAwD4ZTW0).…”

Section: Curiosity For Exploration and Discovery In An Open Worldmentioning

confidence: 99%

“…before finding the first few action sequences that produce ball movement. This problem of RL approaches that focus on hill-climing of the extrinsic reward is now well-known, and applies to man environments with rare or deceptive rewards 1 [Bellemare et al, 2016, Sigaud and Stulp, 2018, Colas et al, 2018].…”

Section: Curiosity For Exploration and Discovery In An Open Worldmentioning

confidence: 99%

Computational Theories of Curiosity-Driven Learning

Oudeyer¹

2018

Preprint

View full text Add to dashboard Cite

What are the functions of curiosity? What are the mechanisms of curiosity-driven learning? We approach these questions about the living using concepts and tools from machine learning and developmental robotics. We argue that curiosity-driven learning enables organisms to make discoveries to solve complex problems with rare or deceptive rewards. By fostering exploration and discovery of a diversity of behavioural skills, and ignoring these rewards, curiosity can be efficient to bootstrap learning when there is no information, or deceptive information, about local improvement towards these problems. We also explain the key role of curiosity for efficient learning of world models. We review both normative and heuristic computational frameworks used to understand the mechanisms of curiosity in humans, conceptualizing the child as a sense-making organism. These frameworks enable us to discuss the bi-directional causal links between curiosity and learning, and to provide new hypotheses about the fundamental role of curiosity in self-organizing developmental structures through curriculum learning. We present various developmental robotics experiments that study these mechanisms in action, both supporting these hypotheses to understand better curiosity in humans and opening new research avenues in machine learning and artificial intelligence. Finally, we discuss challenges for the design of experimental paradigms for studying curiosity in psychology and cognitive neuroscience.

show abstract

Policy search in continuous action domains: An overview

Cited by 61 publications

References 98 publications

CoRide

CoRide

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Computational Theories of Curiosity-Driven Learning

Contact Info

Product

Resources

About