Efficiently Combining Human Demonstrations and Interventions for Safe Training of Autonomous Systems in Real-Time

Goecks, Vinicius G.; Gremillion, Gregory M.; Lawhern, Vernon J.; Valasek, John; Waytowich, Nicholas R.

doi:10.1609/aaai.v33i01.33012462

Cited by 39 publications

(42 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Again due to faster-learning, our proposed method sub-goal+LbB uses the least amount of data. Our results also confirm results from (Goecks et al 2019) that Learning from Intervention (LfI) is data-efficient, as it uses only the intervention data rather than all data. As the demonstrations continue, it becomes more likely to encounter seen sates and the states where the algorithm already performs well.…”

Section: Experiments and Resultssupporting

confidence: 86%

See 1 more Smart Citation

Learning from Interventions Using Hierarchical Policies for Safe Learning

Dhiman

Xiao

et al. 2020

AAAI

View full text Add to dashboard Cite

Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken. Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert's reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert's interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals for the future and another that generates actions to reach those desired sub-goals. This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert's reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and achieves better asymptotic performance than typical LfD.

show abstract

Section: Experiments and Resultssupporting

confidence: 86%

“…• CoL stands for Cycle-of-Learning (Goecks et al 2019), which uses only intervention data as additional demonstration data and ignores the non-intervention data.…”

Section: Discussionmentioning

confidence: 99%

Learning from Interventions Using Hierarchical Policies for Safe Learning

Dhiman

Xiao

et al. 2020

AAAI

View full text Add to dashboard Cite

show abstract

“…The policy is allowed to roll-out and is trained with a combined loss from a mix of demonstration and agent data, stored in a separate first-in-first-out buffer. We validate our approach in three environments with continuous observation-and action-space: LunarLanderContinuous-v2 (Brockman et al 2016) (dense and sparse reward cases) and a custom quadrotor landing task (Goecks et al 2019) implemented using Microsoft AirSim (Shah et al 2017). The dense reward case of LunarLanderContinuous-v2 is the standard environment provided by OpenAI Gym library (Brockman et al 2016): state space consists of a eight-dimensional continuous vector with inertial states of the lander, action space consists of a two-dimensional continuous vector controlling main and side thrusts, and reward is given at every step based on the relative motion of the lander with respect to the landing pad (bonus reward is given when the landing is completed successfully).…”

Section: Methodsmentioning

confidence: 99%

Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Dense and Sparse Reward Environments

Goecks,

Gremillion,

Lawhern

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

This paper investigates how to efficiently transition and update policies, trained initially with demonstrations, using off-policy actor-critic reinforcement learning. It is well-known that techniques based on Learning from Demonstrations, for example behavior cloning, can lead to proficient policies given limited data. However, it is currently unclear how to efficiently update that policy using reinforcement learning as these approaches are inherently optimizing different objective functions. Previous works have used loss functions which combine behavioral cloning losses with reinforcement learning losses to enable this update, however, the components of these loss functions are often set anecdotally, and their individual contributions are not well understood. In this work we propose the Cycle-of-Learning (CoL) framework that uses an actor-critic architecture with a loss function that combines behavior cloning and 1-step Qlearning losses with an off-policy pre-training step from human demonstrations. This enables transition from behavior cloning to reinforcement learning without performance degradation and improves reinforcement learning in terms of overall performance and training time. Additionally, we carefully study the composition of these combined losses and their impact on overall policy learning. We show that our approach outperforms stateof-the-art techniques for combining behavior cloning and reinforcement learning for both dense and sparse reward scenarios. Our results also suggest that directly including the behavior cloning loss on demonstration data helps to ensure stable learning and ground future policy updates.

show abstract

“…Considering these two facts, in this work, we deployed an intervention-based DAgger algorithm so that the human pilot can always take over the control when the UAV has reached an unsafe region and provide recovery actions. Relevant work [2,14,21] have shown that the interventionbased approach can learn a policy more effectively and achieve better performance.…”

Section: Related Workmentioning

confidence: 99%

Vision-Based 2D Navigation of Unmanned Aerial Vehicles in Riverine Environments with Imitation Learning

Wei

Liang

Michelmore³

et al. 2022

J Intell Robot Syst

View full text Add to dashboard Cite

There have been many researchers studying how to enable unmanned aerial vehicles (UAVs) to navigate in complex and natural environments autonomously. In this paper, we develop an imitation learning framework and use it to train navigation policies for the UAV flying inside complex and GPS-denied riverine environments. The UAV relies on a forward-pointing camera to perform reactive maneuvers and navigate itself in 2D space by adapting the heading. We compare the performance of a linear regression-based controller, an end-to-end neural network controller and a variational autoencoder (VAE)-based controller trained with data aggregation method in the simulation environments. The results show that the VAE-based controller outperforms the other two controllers in both training and testing environments and is able to navigate the UAV with a longer traveling distance and a lower intervention rate from the pilots.

show abstract

Efficiently Combining Human Demonstrations and Interventions for Safe Training of Autonomous Systems in Real-Time

Cited by 39 publications

References 2 publications

Learning from Interventions Using Hierarchical Policies for Safe Learning

Learning from Interventions Using Hierarchical Policies for Safe Learning

Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Dense and Sparse Reward Environments

Vision-Based 2D Navigation of Unmanned Aerial Vehicles in Riverine Environments with Imitation Learning

Contact Info

Product

Resources

About