1992
DOI: 10.1023/a:1022676722315
|View full text |Cite
|
Sign up to set email alerts
|

Untitled

Abstract: Abstract. Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.This paper presents and proves in detail a convergence theorem for Q,-learning based on that outlined in Watkins (1989). We show that Q-learning converges to the optimum acti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
185
0
6

Year Published

2007
2007
2018
2018

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 1,404 publications
(191 citation statements)
references
References 12 publications
0
185
0
6
Order By: Relevance
“…delayed Q-learning [13] would be a better option if speed were an issue). We use an off-the-shelf implementation of Q-learning, as explained in [18] and [14]. We use the description of cell contents as a state.…”
Section: An Ai Agent: Q-learningmentioning
confidence: 99%
See 1 more Smart Citation
“…delayed Q-learning [13] would be a better option if speed were an issue). We use an off-the-shelf implementation of Q-learning, as explained in [18] and [14]. We use the description of cell contents as a state.…”
Section: An Ai Agent: Q-learningmentioning
confidence: 99%
“…In this paper we use one of these tests, a prototype based on the anytime intelligence test presented in [5] and the environment class introduced in [4], to evaluate one easily accessible biological system (Homo sapiens) and one off-the-shelf AI system, a popular reinforcement algorithm known as Q-learning [18]. In order to do the comparison we use the same environment class for both types of systems and we design hopefully non-biased interfaces for both.…”
Section: Introductionmentioning
confidence: 99%
“…Both paradigms use the TD error to update the state value. Q-learning is based on the TD algorithm, and optimizes the long term value of performing a particular action in a given state by generating and updating a state-action value function Q (Sutton and Barto 1998;Watkins and Dayan 1992). This model assigns a Q-value for each action-state pair (rather than simply for each state as in standard TD).…”
Section: Q-learning Algorithm and The Actor-critic Modelmentioning
confidence: 99%
“…Q-learning finds the Q-value by iteratively approximating the Q-function using the difference between the predicted value and the actual value as the estimation error [38]. γ ∈ [0, 1] is the discount factor and if γ is high, the system gives a higher weight to the Q-value of the new state by the action than the reward of the past action.…”
Section: Dynamic Sensing Parameter Control Using Q-learningmentioning
confidence: 99%