2010
DOI: 10.1007/978-3-642-16111-7_23
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive ε-Greedy Exploration in Reinforcement Learning Based on Value Differences

Abstract: Abstract. This paper presents "Value-Difference Based Exploration" (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of ε-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent's uncertainty about the environment. VDBE is evaluated on a multi-armed bandit task, which allows for insight into the behavior of the method. Preli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
136
0
2

Year Published

2014
2014
2021
2021

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 211 publications
(139 citation statements)
references
References 8 publications
1
136
0
2
Order By: Relevance
“…Most often, -greedy method is selected since it is not required to memorize any exploration specific data and also in many applications it achieved near optimal results [40].…”
Section: Exploration-exploitation Strategymentioning
confidence: 99%
“…Most often, -greedy method is selected since it is not required to memorize any exploration specific data and also in many applications it achieved near optimal results [40].…”
Section: Exploration-exploitation Strategymentioning
confidence: 99%
“…Exploration strategies conventionally use some form of the posterior variance [7] or the information-entropy [8] to quantify the information gained by a given action. While both strategies make intuitive sense, they are mathematically imprecise descriptors of the information gain which, due to a theorem of uniqueness in information theory [1], is defined as…”
Section: A Information Gain and Explorationmentioning
confidence: 99%
“…We chose to model our situation as a multi-armed bandit problem [22]. There is a room full of casino machines (the bandits) with different reward distributions.…”
Section: Reinforcement Learningmentioning
confidence: 99%