2021
DOI: 10.48550/arxiv.2103.04529
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Supervised Online Reward Shaping in Sparse-Reward Environments

Abstract: We propose a novel reinforcement learning framework that performs self-supervised online reward shaping, yielding faster, sample efficient performance in sparse reward environments. The proposed framework alternates between updating a policy and inferring a reward function. While the policy update is done with the inferred, potentially dense reward function, the original sparse reward is used to provide a self-supervisory signal for the reward update by serving as an ordering over the observed trajectories. Th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 16 publications
0
7
0
Order By: Relevance
“…Number of PPO epochs: 10. Number of projected gradient ascent steps to compute δ s and δ s,a through ( 9), (12) in the main text: 10 steps. PPO clipping parameter: 0.2.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Number of PPO epochs: 10. Number of projected gradient ascent steps to compute δ s and δ s,a through ( 9), (12) in the main text: 10 steps. PPO clipping parameter: 0.2.…”
Section: Methodsmentioning
confidence: 99%
“…We perform two sets of experiments, one set uses the L 2 norm and the other uses L ∞ norm throughout the experiments. The norms are used for the following: 1) Defining the balls in which we find the adversarial perturbations δ s and δ s,a through ( 9), (12) in the main text. 2) Defining the ball from which we sample the noise injected at test time.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…However, such approach can easily exploit badly designed rewards, and get stuck in local optima and induce behavior that the designer did not intend. In contrast, goal-based sparse rewards are appealing since they do not suffer from the reward exploration problem [32]. In addition, this simple small set of rules have its similarities with biological behaviours, and therefore, applicable to animals with very limited level of information processing [33].…”
Section: G Reward Functionmentioning
confidence: 99%