A Study On Accelerating Adversarial Imitation Learning By Behavioral Cloning

Chen

2023

AAAI

Adversarial imitation learning has become a widely used imitation learning framework. The discriminator is often trained by taking expert demonstrations and policy trajectories as examples respectively from two categories (positive vs. negative) and the policy is then expected to produce trajectories that are indistinguishable from the expert demonstrations. But in the real world, the collected expert demonstrations are more likely to be imperfect, where only an unknown fraction of the demonstrations are optimal. Instead of treating imperfect expert demonstrations as absolutely positive or negative, we investigate unlabeled imperfect expert demonstrations as they are. A positive-unlabeled adversarial imitation learning algorithm is developed to dynamically sample expert demonstrations that can well match the trajectories from the constantly optimized agent policy. The trajectories of an initial agent policy could be closer to those non-optimal expert demonstrations, but within the framework of adversarial imitation learning, agent policy will be optimized to cheat the discriminator and produce trajectories that are similar to those optimal expert demonstrations. Theoretical analysis shows that our method learns from the imperfect demonstrations via a self-paced way. Experimental results on MuJoCo and RoboSuite platforms demonstrate the effectiveness of our method from different aspects.

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Unlabeled Imperfect Demonstrations in Adversarial Imitation Learning

Chen

2023

AAAI

“…An alternating interaction between weight estimation and GAIL training therefore holds. There are also some researches (Sasaki and Yamashina 2021;Kim et al 2021;Xu et al 2022;Liu et al 2022) on addressing imperfect demonstrations issue in offline imitation learning. BCND (Sasaki and Yamashina 2021) is a weighted behavioral cloning method, with action distribution of learned policy as confidence.…”

Section: Related Workmentioning

confidence: 99%

Research on Network Information Security Risk Assessment Based on Artificial Intelligence

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

2021

Adversarial imitation learning has become a widely used imitation learning framework. The discriminator is often trained by taking expert demonstrations and policy trajectories as examples respectively from two categories (positive vs. negative) and the policy is then expected to produce trajectories that are indistinguishable from the expert demonstrations. But in the real world, the collected expert demonstrations are more likely to be imperfect, where only an unknown fraction of the demonstrations are optimal. Instead of treating imperfect expert demonstrations as absolutely positive or negative, we investigate unlabeled imperfect expert demonstrations as they are. A positive-unlabeled adversarial imitation learning algorithm is developed to dynamically sample expert demonstrations that can well match the trajectories from the constantly optimized agent policy. The trajectories of an initial agent policy could be closer to those non-optimal expert demonstrations, but within the framework of adversarial imitation learning, agent policy will be optimized to cheat the discriminator and produce trajectories that are similar to those optimal expert demonstrations. Theoretical analysis shows that our method learns from the imperfect demonstrations via a self-paced way. Experimental results on MuJoCo and Robo-Suite platforms demonstrate the effectiveness of our method from different aspects.

“…Yu et al ( 2021) uses Q-functions to filter which data should be shared between tasks in a multi-task setting. In the imitation learning setting, Nair et al (2018) and Sasaki & Yamashina (2020) use Q-functions to filter out low-quality demonstrations, so they are not used for training. In both cases, the Q-function is used to evaluate some data that can be used for training.…”

Section: Using Q-functions As Filtersmentioning

confidence: 99%

“…The Q-function, Q i (s, a), of Task i estimates the expected discounted return of the policy after taking action a at state s (Watkins & Dayan, 1992). Although this is an estimate acquired during training, it is a critical component in many state-of-the-art RL algorithms (Haarnoja et al, 2018;Lillicrap et al, 2015) and has been used to filter for high-quality data in multi-task (Yu et al, 2021) and imitation learning settings (Nair et al, 2018;Sasaki & Yamashina, 2020), which suggests the Q-function is still very effective for evaluating and comparing actions during training. Unlike single-task RL, we use the Q-function as a switch that rates action proposals from other tasks' policies for the current task's state s. This simple and intuitive function is state and task-dependent, gives the current best estimate of which behaviors are most helpful, and is quickly adaptive to changes in its own and other policies during online learning.…”

Section: Q-switchmentioning

confidence: 99%

Multi-task Deep Reinforcement Learning for Scalable Parallel Task Scheduling

Zhang

2019 IEEE International Conference on Big Data (Big Data)

et al. 2019

The ability to leverage shared behaviors between tasks is critical for sample-efficient multi-task reinforcement learning (MTRL). While prior methods have primarily explored parameter and data sharing, direct behavior-sharing has been limited to task families requiring only similar behaviors. Our goal is to extend the efficacy of behaviorsharing to more general task families that could require a mix of shareable and conflicting behaviors. Our key insight is an agent's behavior across tasks can be used for mutually beneficial exploration. To this end, we propose a simple MTRL framework for identifying shareable behaviors over tasks and incorporating them to guide exploration. We empirically demonstrate how behavior sharing improves sample efficiency and final performance on manipulation and navigation MTRL tasks and is even complementary to parameter sharing. Result videos are available at https://sites.google.com/view/qmp-mtrl.