2016
DOI: 10.1609/aaai.v30i1.10227
|View full text |Cite
|
Sign up to set email alerts
|

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Abstract: We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm, which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation unde… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
7
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 9 publications
2
7
0
Order By: Relevance
“…As previously described for goal oriented RL, contextual RL aims to solve problems known as Contextual MDP (CMDP) [16]. The first proposed version of CMDP in [17] defines context as augmentation of MDP using side information as form of context, in a similar fashion to contextual bandits. Contextual MDP (CMDP) can be defined as follows:…”
Section: Contextual Rlmentioning
confidence: 99%
“…As previously described for goal oriented RL, contextual RL aims to solve problems known as Contextual MDP (CMDP) [16]. The first proposed version of CMDP in [17] defines context as augmentation of MDP using side information as form of context, in a similar fashion to contextual bandits. Contextual MDP (CMDP) can be defined as follows:…”
Section: Contextual Rlmentioning
confidence: 99%
“…Motivation: The inspiration for the algorithm comes from the work of Hallak et al [26]. Hallak et al [26] demonstrate that by controlling β in the ETD(λ, β) algorithm, the variance of the algorithm can be reduced while still maintaining a reasonable bias, thus allowing for improved performance. The vanilla GEM algorithm for emphasis estimation and its variant GEM-ETD for OPE truly trade off bias and variance.…”
Section: Proposed Emphatic Algorithmsmentioning
confidence: 99%
“…The ETD approach, which was originally proposed in the pioneering work of Sutton et al [1], involves a scalar followon trace to surmount the distribution mismatch issue with off-policy learning. On the other hand, Hallak et al [26] demonstrated that the variance of the followon trace can be unbounded over a long or infinite time horizon. They further proposed the ETD(0, β) framework to bound the followon trace with a tunable hyperparameter but at the cost of a possibly large bias error, where β is a variable decay rate used for bias-variance trade-off.…”
Section: Generalized Gradient Emphasis Learning With Function Approxi...mentioning
confidence: 99%
See 2 more Smart Citations