2011
DOI: 10.1007/978-3-642-24455-1_33
|View full text |Cite
|
Sign up to set email alerts
|

Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax

Abstract: Abstract. This paper proposes "Value-Difference Based Exploration combined with Softmax action selection" (VDBE-Softmax) as an adaptive exploration/exploitation policy for temporal-difference learning. The advantage of the proposed approach is that exploration actions are only selected in situations when the knowledge about the environment is uncertain, which is indicated by fluctuating values during learning. The method is evaluated in experiments having deterministic rewards and a mixture of both determinist… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
107
0
1

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 176 publications
(108 citation statements)
references
References 9 publications
0
107
0
1
Order By: Relevance
“…One of the most challenging tasks in RL can be found in balancing between exploration and exploitation (Tokic & Palm, 2011). An often used approach to this tradeoff is the -greedy method (Watkins, 1989).…”
Section: Exploration Policymentioning
confidence: 99%
See 1 more Smart Citation
“…One of the most challenging tasks in RL can be found in balancing between exploration and exploitation (Tokic & Palm, 2011). An often used approach to this tradeoff is the -greedy method (Watkins, 1989).…”
Section: Exploration Policymentioning
confidence: 99%
“…al. propose the Value-Difference Based Exploration with Softmax action selection (VDBE-Softmax) policy (Tokic, 2010;Tokic & Palm, 2011). With VDBE-Softmax, the -greedy and the Softmax policy are combined in a way that exploration is performed, using the Softmax probabilities defined in Equation (4), with probability .…”
Section: Exploration Policymentioning
confidence: 99%
“…With this method, the probability of selecting is distributed uniformly among all the actions. -The second one is the Value-Difference Based Exploration with Softmax action selection policy (VDBE-Softmax) [9]: the client selects random actions using the Softmax probabilities in case of that £ < e and it chooses the greedy action otherwise. The Softmax probabilities are determined through the Boltzmann distribution proposed by Tokic [9] using a normalization of the Q values into the interval [-1,0] and a value for the temperature parameter equals T = 0.01.…”
Section: Influence Of the Exploration Policymentioning
confidence: 99%
“…A good example of a possible alternative approach would be the Softmax algorithm [19], which uses Boltzmann distribution to define action-selection probabilities:…”
Section: Defining Actionsmentioning
confidence: 99%