Proceedings of the International Joint Conference on Neural Networks, 2003.
DOI: 10.1109/ijcnn.2003.1223699
|View full text |Cite
|
Sign up to set email alerts
|

Competitive reinforcement learning in continuous control tasks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Publication Types

Select...
6

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 11 publications
0
7
0
Order By: Relevance
“…The optimal number of steps indicated in (Smart and Kaelbling, 2000) is 56 steps so 60 steps per episode can be considered optimal due to the continuing exploration caused by the -1 reward on every step. Therefore the converged learning result is also as good as in (Abramson et al, 2003) and (Schaefer et al, 2007). (Abramson et al, 2003) used Learning Vector Quantization together with Sarsa instead of using NRBF so the complexity of the learning algorithm is similar to the one used in this paper.…”
Section: Mountain-carmentioning
confidence: 64%
See 2 more Smart Citations
“…The optimal number of steps indicated in (Smart and Kaelbling, 2000) is 56 steps so 60 steps per episode can be considered optimal due to the continuing exploration caused by the -1 reward on every step. Therefore the converged learning result is also as good as in (Abramson et al, 2003) and (Schaefer et al, 2007). (Abramson et al, 2003) used Learning Vector Quantization together with Sarsa instead of using NRBF so the complexity of the learning algorithm is similar to the one used in this paper.…”
Section: Mountain-carmentioning
confidence: 64%
“…Therefore the converged learning result is also as good as in (Abramson et al, 2003) and (Schaefer et al, 2007). (Abramson et al, 2003) used Learning Vector Quantization together with Sarsa instead of using NRBF so the complexity of the learning algorithm is similar to the one used in this paper. The reward function provided more guidance towards the goal than the one used here and a fixed starting point was used instead of a random one.…”
Section: Mountain-carmentioning
confidence: 64%
See 1 more Smart Citation
“…When we deal with an off-policy, the evaluation of the move made in a state will affect the policy from the current move but the move itself is independent of the policy update and therefore we can use a different policy to make the move. On-line policies depend a lot on exploration in order for the action values to be accurate [23].…”
Section: On-policy Vs Off-policy Methodsmentioning
confidence: 99%
“…This function approximation approach separates learning the action value function from learning the state representation (but see [8] for a combined approach). The intermediate reward r c is obtained from the user decision at the choice point while the discounted terminal rewards upon reaching the goal states are obtained from the underlying temporal MDP.…”
Section: Reinforcement Learning Of User Preferencesmentioning
confidence: 99%