Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Chen, Xinyue; Che, Wang; Zhou, Zijian; Ross, Keith W.

doi:10.48550/arxiv.2101.05982

Cited by 18 publications

(59 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, we implement the critic's model ensemble as a single neural network, using linear non-fully-connected layers evenly splitting the nodes and dropping the weight connections between the splits. Practically, when evaluated under the same hardware, this results in our algorithm running more than two times faster than the implementation from Chen et al (2021) while having a similar algorithmic complexity.…”

Section: Methodsmentioning

confidence: 93%

“…Moreover, Lan et al (2020) introduced a sampling procedure for the critic's ensemble predictions to regulate underestimation in the TD-targets. Their work was later extended to the continuous setting by Chen et al (2021), which showed that large ensembles combined with a high update-to-data ratio enable to outperform the sample efficiency of contemporary model-based methods. Ensembling has also been used to achieve better exploration following the principle of optimism in the face of uncertainty (Brafman & Tennenholtz, 2002) in both discrete (Osband et al, 2016;Chen et al, 2017) and continuous settings (Ciosek et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, we substitute SAC's clipped double Q-learning with our uncertainty regularizer, initialized with β = 0.5. We also adopt additional practices inspired by recent related works (Chen et al, 2021;Bjorck et al, 2021). In particular, we parameterize ten independent action-value functions as 'modern' residual neural networks with spectral normalization (Miyato et al, 2018) and increase the critic's update-to-data (UTD) ratio to twenty.…”

Section: Continuous Control From Proprioceptive Observationsmentioning

confidence: 99%

“…Consequently, so far, many RL milestones have been achieved through simulating conspicuous amounts of experience and tuning for effective task-specific parameters (Mnih et al, 2013;Silver et al, 2017). Recent off-policy model-free (Chen et al, 2021) and model-based algorithms (Janner et al, 2019), pushed forward the state-of-the-art sample-efficiency on several benchmark simulation tasks (Brockman et al, 2016). We attribute such improvements to two main linked advances: more expressive models to capture uncertainty and better strategies to counteract detrimental biases from the learning process.…”

Section: Introductionmentioning

confidence: 99%

“…We incorporate GPL with modern implementations of the Soft Actor-Critic (SAC) (Haarnoja et al, 2018a;b) and Data-regularized Q (DrQ) (Yarats et al, 2021b;a) algorithms, yielding GPL-SAC and GPL-DrQ, respectively. On the Mujoco environments from the OpenAI Gym suite (Todorov et al, 2012;Brockman et al, 2016), GPL-SAC outperforms prior, more expensive, model-based (Janner et al, 2019) and model-free (Chen et al, 2021) state-of-the-art algorithms. For instance, in the Humanoid environment GPL-SAC is able to recover a score of 5000 in less than 100K experience steps, more than nine times faster than regular SAC.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Cetin¹,

Çeliktutan²

2021

Preprint

View full text Add to dashboard Cite

Popular off-policy deep reinforcement learning algorithms compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose a novel learnable penalty to enact such pessimism, based on a new way to quantify the critic's epistemic uncertainty. Furthermore, we propose to learn the penalty alongside the critic with dual TD-learning, a strategy to estimate and minimize the bias magnitude in the target returns. Our method enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. Empirically, by integrating our method and other orthogonal improvements with popular off-policy algorithms, we achieve state-of-the-art results in continuous control tasks from both proprioceptive and pixel observations.

show abstract

Section: Methodsmentioning

confidence: 93%

Section: Related Workmentioning

confidence: 99%

Section: Continuous Control From Proprioceptive Observationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Cetin¹,

Çeliktutan²

2021

Preprint

View full text Add to dashboard Cite

show abstract

Learning From Guided Play: Improving Exploration for Adversarial Imitation Learning With Simple Auxiliary Tasks

Ablett

Chan

Kelly

2023

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naïve approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.

show abstract

Fault-Tolerant Predictive Control With Deep-Reinforcement-Learning-Based Torque Distribution for Four In-Wheel Motor Drive Electric Vehicles

Deng

Zhao

Nguyen

et al. 2023

IEEE/ASME Trans. Mechatron.

View full text Add to dashboard Cite

This paper proposes a fault-tolerant control (FTC) method for four in-wheel motor drive electric vehicles considering both vehicle stability and motor power consumption. First, a seven degrees-of-freedom vehicle nonlinear model integrating motor faults is built to design a hierarchical FTC control scheme. The control structure is composed of two levels: an upperlevel nonlinear model predictive controller and a lower-level fault-tolerant coordinated controller. The upper-level controller provides an appropriate reference in terms of additional yaw moment and vehicle longitudinal force, required for vehicle stability control, to the lower-level controller. This latter aims at distributing the four-wheel torques taking into account both vehicle stability and power consumption. Specifically, the weighting factor involved in the optimization-based design of the lower-level controller is determined online by the randomized ensembled double Q−learning reinforcement learning algorithm to achieve an optimal control strategy for the whole vehicle operating range. Moreover, the tradeoff between vehicle stability and power consumption is analyzed, and the necessity of using reinforcement learning is discussed. Numerical experiments are performed under various driving scenarios with a high-fidelity CarSim vehicle model to demonstrate the effectiveness of the proposed control method. Via a comparative study, we highlight the advantages of the new FTC control method over many related existing control results in terms of improving the vehicle stability and driver comfort, as well as reducing the power consumption.

show abstract

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Cited by 18 publications

References 22 publications

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Learning From Guided Play: Improving Exploration for Adversarial Imitation Learning With Simple Auxiliary Tasks

Fault-Tolerant Predictive Control With Deep-Reinforcement-Learning-Based Torque Distribution for Four In-Wheel Motor Drive Electric Vehicles

Contact Info

Product

Resources

About