Self-Supervised Online Reward Shaping in Sparse-Reward Environments

Memarian, F.; Goo, Wonjoon; Lioutikov, Rudolf; Niekum, Scott; Topcu, Ufuk

doi:10.48550/arxiv.2103.04529

Cited by 4 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Number of PPO epochs: 10. Number of projected gradient ascent steps to compute δ s and δ s,a through ( 9), (12) in the main text: 10 steps. PPO clipping parameter: 0.2.…”

Section: Methodsmentioning

confidence: 99%

“…We perform two sets of experiments, one set uses the L 2 norm and the other uses L ∞ norm throughout the experiments. The norms are used for the following: 1) Defining the balls in which we find the adversarial perturbations δ s and δ s,a through ( 9), (12) in the main text. 2) Defining the ball from which we sample the noise injected at test time.…”

Section: Methodsmentioning

confidence: 99%

“…Note that δ s is the perturbation within the L 2 ball of radius r p which causes the largest divergence in the policy. Since it is computationally infeasible to find the exact solution to (12) for every sate, we instead use projected gradient ascent steps to get close to the solution. GAIL uses TRPO [15] steps to update the generator.…”

Section: Inducing Local Lipschitzness In the Generatormentioning

confidence: 99%

“…For a policy π, the discounted causal entropy is defined as H(π) := E (s,a)∈ρπ [− log(π(a|s)/(1 − γ)] in which ρ π is the state-action distribution induced by policy π. Generative Adversarial Imitation Learning. Imitation learning algorithms [6,7,8,9,10,11,12] aim to learn a policy that mimics the underlying behavior of the demonstrations. Methods such as inverse reinforcement learning (IRL) [7,8,9] do so by learning a reward function as an intermediate step.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Robust Generative Adversarial Imitation Learning via Local Lipschitzness

Memarian,

Hashemi,

Niekum

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We explore methodologies to improve the robustness of generative adversarial imitation learning (GAIL) algorithms to observation noise. Towards this objective, we study the effect of local Lipschitzness of the discriminator and the generator on the robustness of policies learned by GAIL. In many robotics applications, the learned policies by GAIL typically suffer from a degraded performance at test time since the observations from the environment might be corrupted by noise. Hence, robustifying the learned policies against the observation noise is of critical importance. To this end, we propose a regularization method to induce local Lipschitzness in the generator and the discriminator of adversarial imitation learning methods. We show that the modified objective leads to learning significantly more robust policies. Moreover, we demonstrate -both theoretically and experimentally -that training a locally Lipschitz discriminator leads to a locally Lipschitz generator, thereby improving the robustness of the resultant policy. We perform extensive experiments on simulated robot locomotion environments from the MuJoCo suite that demonstrate the proposed method learns policies that significantly outperform the state-of-the-art generative adversarial imitation learning algorithm when applied to test scenarios with noise-corrupted observations.

show abstract

“…Number of PPO epochs: 10. Number of projected gradient ascent steps to compute δ s and δ s,a through ( 9), (12) in the main text: 10 steps. PPO clipping parameter: 0.2.…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Inducing Local Lipschitzness In the Generatormentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Robust Generative Adversarial Imitation Learning via Local Lipschitzness

Memarian,

Hashemi,

Niekum

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, such approach can easily exploit badly designed rewards, and get stuck in local optima and induce behavior that the designer did not intend. In contrast, goal-based sparse rewards are appealing since they do not suffer from the reward exploration problem [32]. In addition, this simple small set of rules have its similarities with biological behaviours, and therefore, applicable to animals with very limited level of information processing [33].…”

Section: G Reward Functionmentioning

confidence: 99%

A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles

Masmitja¹,

Martín

Katija³

et al. 2022

2022 IEEE 18th International Conference on Automation Science and Engineering (CASE)

View full text Add to dashboard Cite

Underwater target localization using range-only and single-beacon (ROSB) techniques with autonomous vehicles has been used recently to improve the limitations of more complex methods, such as long baseline and ultra-short baseline systems. Nonetheless, in ROSB target localization methods, the trajectory of the tracking vehicle near the localized target plays an important role in obtaining the best accuracy of the predicted target position. Here, we investigate a Reinforcement Learning (RL) approach to find the optimal path that an autonomous vehicle should follow in order to increase and optimize the overall accuracy of the predicted target localization, while reducing time and power consumption. To accomplish this objective, different experimental tests have been designed using state-of-the-art deep RL algorithms. Our study also compares the results obtained with the analytical Fisher information matrix approach used in previous studies. The results revealed that the policy learned by the RL agent outperforms trajectories based on these analytical solutions, e.g. the median predicted error at the beginning of the target's localisation is 17% less. These findings suggest that using deep RL for localizing acoustic targets can be successfully applied to in-water applications that include tracking of acoustically tagged marine animals by autonomous underwater vehicles. This is envisioned as a first necessary step to validate the use of RL to tackle such problems, which could be used later on in a more complex scenarios I. INTRODUCTIONOne of the main challenges in marine research lies in underwater positioning of underwater features or assets (e.g., marine species [1] or underwater vehicles [2]). Due to the large attenuation of radio waves in water [3], Global Positioning System (GPS) signals are not suitable for positioning underwater targets. Nonetheless, acoustic signals can fill the underwater communications gap left by radio waves. Acoustic signals have much greater underwater propagation capabilities [4], and therefore, a network of nodes or beacons can be deployed and used to localize underwater targets, which may include Autonomous Underwater Vehicles (AUV), benthic rovers, or acoustically tagged organisms.Unfortunately, underwater acoustic deployments are often complex and highly economically and logistically expensive

show abstract

Path Planning of Unmanned Helicopter in Complex Dynamic Environment Based on State-Coded Deep Q-Network

Yao

Zhang

et al. 2022

Symmetry

View full text Add to dashboard Cite

Unmanned helicopters (UH) can avoid radar detection by flying at ultra-low altitudes; thus, they have been widely used in the battlefield. The flight safety of UH is seriously affected by moving obstacles such as flocks of birds in low airspace. Therefore, an algorithm that can plan a safe path to UH is urgently needed. Due to the strong randomness of the movement of bird flocks, the existing path planning algorithms are incompetent for this task. To solve this problem, a state-coded deep Q-network (SC-DQN) algorithm with symmetric properties is proposed, which can effectively avoid randomly moving obstacles and plan a safe path for UH. First, a dynamic reward function is designed to give UH appropriate rewards in real time, so as to improve the sparse reward problem. Then, a state-coding scheme is proposed, which uses binary Boolean expression to encode the environment state to compress environment state space. The encoded state is used as the input to the deep learning network, which is an important improvement to the traditional algorithm. Experimental results show that the SC-DQN algorithm can help UH avoid the moving obstacles to unknown motion status in the environment safely and effectively and successfully complete the raid task.

show abstract

Self-Supervised Online Reward Shaping in Sparse-Reward Environments

Cited by 4 publications

References 16 publications

Robust Generative Adversarial Imitation Learning via Local Lipschitzness

Robust Generative Adversarial Imitation Learning via Local Lipschitzness

A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles

Path Planning of Unmanned Helicopter in Complex Dynamic Environment Based on State-Coded Deep Q-Network

Contact Info

Product

Resources

About