Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 2019
DOI: 10.24963/ijcai.2019/273
|View full text |Cite
|
Sign up to set email alerts
|

Unobserved Is Not Equal to Non-existent: Using Gaussian Processes to Infer Immediate Rewards Across Contexts

Abstract: Learning optimal policies in real-world domains with delayed rewards is a major challenge in Reinforcement Learning. We address the credit assignment problem by proposing a Gaussian Process (GP)-based immediate reward approximation algorithm and evaluate its effectiveness in 4 contexts where rewards can be delayed for long trajectories. In one GridWorld game and 8 Atari games, where immediate rewards are available, our results showed that on 7 out 9 games, the proposed GP-inferred reward policy performed at le… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 11 publications
0
6
0
Order By: Relevance
“…Furthermore, our learned weightings over demonstrators could be seen as a form of attention [Zadeh et al, 2018]. The idea of learning shared features is inspired by both encoder sharing [Flet-Berliac and Preux, 2019] and uncertainty quantification [Azizsoltani et al, 2019;Brown et al, 2020]. Finally, our approach shares some of the similarities of Bayesian policy reuse [Rosman et al, 2016], by formulating the problem of policy selection as a Bayesian choice problem.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, our learned weightings over demonstrators could be seen as a form of attention [Zadeh et al, 2018]. The idea of learning shared features is inspired by both encoder sharing [Flet-Berliac and Preux, 2019] and uncertainty quantification [Azizsoltani et al, 2019;Brown et al, 2020]. Finally, our approach shares some of the similarities of Bayesian policy reuse [Rosman et al, 2016], by formulating the problem of policy selection as a Bayesian choice problem.…”
Section: Related Workmentioning
confidence: 99%
“…This is an alternative prior method for inferring the immediate rewards from the delayed ones. Prior work has shown that InferGP works reasonably well in a wide range of offline RL tasks [8]. Our goal is to determine the efficacy of InferNet for offline RL tasks, when compared to immediate, delayed and InferGP rewards.…”
Section: Offline Rl Experimentsmentioning
confidence: 99%
“…Recently, several DRL approaches have been investigated for septic treatment, utilizing Electronic Health Records (EHRs). However, [17,18] only considered delayed rewards, while [8] leveraged the Gaussian process based immediate reward inference method, which is one of our baselines. Data: Our EHRs were collected from a large US healthcare system (July, 2013 to December, 2015).…”
Section: Rmsementioning
confidence: 99%
See 1 more Smart Citation
“…In this paper, we show how sub-optimal demonstrators with conflicting goals can be ranked according to their alignment with the target task goal in a safe Bayesian manner, and reused directly by the target agent without learning auxiliary representations such as policies or value functions. Our approach is related to encoder sharing [18,34,48,49] and uncertainty quantification of rewards or policies [5,12,25,46], but our problem setting and methodology are different.…”
Section: Related Workmentioning
confidence: 99%