Combining learned controllers to achieve new goals based on linearly solvable MDPs

Uchibe, Eiji; Doya, Kenji

doi:10.1109/icra.2014.6907631

Cited by 10 publications

(4 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that the REINFORCE algorithm does not need Q i . In addition, a deterministic stationary policy based on central pattern generators was prepared as prior knowledge, which was implemented by the modified Hopf oscillator (Uchibe and Doya, 2014 ). Since CRAIL uses multiple importance sampling, it is straightforward to use the deterministic policy as one of the sampling policies.…”

Section: Methodsmentioning

confidence: 99%

Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

Uchibe

2018

Front. Neurorobot.

Self Cite

View full text Add to dashboard Cite

This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in parallel. Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that is constructed by a combination of all the policies. Since the mixing weights are determined by the performance of the module, a better policy is automatically selected based on the learning progress. Experimental results on a benchmark control task show that CRAIL successfully achieves fast learning by allowing modules with complicated network structures to exploit task-relevant samples for training.

show abstract

Section: Methodsmentioning

confidence: 99%

Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

Uchibe

2018

Front. Neurorobot.

Self Cite

View full text Add to dashboard Cite

show abstract

“…RL problems requiring policies that solve several tasks at the same time are commonly stated as multiobjective or modular RL problems [12] [13]. The policies of all these subtasks may be combined using weights describing the predictability of the environmental dynamics [14], or the values obtained from the desirability function in a linearly-solvable control context [15]. Another alternative is to combine action-value functions of composable tasks, and then extract a policy from this combined function [8].…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Esteban

Rozo

Caldwell

2019

2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

View full text Add to dashboard Cite

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their corresponding subtasks.

show abstract

“…Recently, there have been various discussions about using autoencoders to control robots (Noda et al, 2014 ; Finn et al, 2016 ; van Hoof et al, 2016 ; Kondo and Takahashi, 2017 ). KullbackLeibler control (Todorov, 2009 ) is an interesting task-dependent approach to control robot (Uchibe and Doya, 2014 ; Matsubara et al, 2015 ) with combination of control policies.…”

Section: Introductionmentioning

confidence: 99%

Generation of Human-Like Movement from Symbolized Information

Okajima

Tournier²,

Alnajjar

et al. 2018

Front. Neurorobot.

View full text Add to dashboard Cite

An important function missing from current robotic systems is a human-like method for creating behavior from symbolized information. This function could be used to assess the extent to which robotic behavior is human-like because it distinguishes human motion from that of human-made machines created using currently available techniques. The purpose of this research is to clarify the mechanisms that generate automatic motor commands to achieve symbolized behavior. We design a controller with a learning method called tacit learning, which considers system–environment interactions, and a transfer method called mechanical resonance mode, which transfers the control signals into a mechanical resonance mode space (MRM-space). We conduct simulations and experiments that involve standing balance control against disturbances with a two-degree-of-freedom inverted pendulum and bipedal walking control with humanoid robots. In the simulations and experiments on standing balance control, the pendulum can become upright after a disturbance by adjusting a few signals in MRM-space with tacit learning. In the simulations and experiments on bipedal walking control, the robots realize a wide variety of walking by manually adjusting a few signals in MRM-space. The results show that transferring the signals to an appropriate control space is the key process for reducing the complexity of the signals from the environment and achieving diverse behavior.

show abstract

Combining learned controllers to achieve new goals based on linearly solvable MDPs

Cited by 10 publications

References 24 publications

Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Generation of Human-Like Movement from Symbolized Information

Contact Info

Product

Resources

About