2012
DOI: 10.1007/978-3-642-27645-3_7
|View full text |Cite
|
Sign up to set email alerts
|

Reinforcement Learning in Continuous State and Action Spaces

Abstract: Abstract. Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good polic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
125
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 152 publications
(125 citation statements)
references
References 125 publications
0
125
0
Order By: Relevance
“…Many variants of traditional RL exist (e.g., Barto et al, 1983;Watkins, 1989;Watkins and Dayan, 1992;Moore and Atkeson, 1993;Schwartz, 1993;Rummery and Niranjan, 1994;Singh, 1994;Baird, 1995;Kaelbling et al, 1995;Peng and Williams, 1996;Mahadevan, 1996;Tsitsiklis and van Roy, 1996;Bradtke et al, 1996;Santamaría et al, 1997;Prokhorov and Wunsch, 1997;Sutton and Barto, 1998;Wiering and Schmidhuber, 1998b;Baird and Moore, 1999;Meuleau et al, 1999;Morimoto and Doya, 2000;Bertsekas, 2001;Brafman and Tennenholtz, 2002;Abounadi et al, 2002;Lagoudakis and Parr, 2003;Sutton et al, 2008;Maei and Sutton, 2010;van Hasselt, 2012). Most are formulated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input events only).…”
Section: Deep Fnns For Traditional Rl and Markov Decision Processes (mentioning
confidence: 99%
“…Many variants of traditional RL exist (e.g., Barto et al, 1983;Watkins, 1989;Watkins and Dayan, 1992;Moore and Atkeson, 1993;Schwartz, 1993;Rummery and Niranjan, 1994;Singh, 1994;Baird, 1995;Kaelbling et al, 1995;Peng and Williams, 1996;Mahadevan, 1996;Tsitsiklis and van Roy, 1996;Bradtke et al, 1996;Santamaría et al, 1997;Prokhorov and Wunsch, 1997;Sutton and Barto, 1998;Wiering and Schmidhuber, 1998b;Baird and Moore, 1999;Meuleau et al, 1999;Morimoto and Doya, 2000;Bertsekas, 2001;Brafman and Tennenholtz, 2002;Abounadi et al, 2002;Lagoudakis and Parr, 2003;Sutton et al, 2008;Maei and Sutton, 2010;van Hasselt, 2012). Most are formulated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input events only).…”
Section: Deep Fnns For Traditional Rl and Markov Decision Processes (mentioning
confidence: 99%
“…The DQN learning parameters in the Q(s, a) function are defined by aligning the maximum expectations of the utilities, which may be biased in some stochastic environments and hence result in overestimation. The double Q-network (van Hasselt et al, 2015) reduces overestimation by combining Q-learning and deep models, and thus can be used to approximate large-scale functions. The deep deterministic policy gradient (DDPG) optimization method for deep reinforcement models has improved robustness gradient (Lillicrap et al, 2015) estimation in dealing with deep continuous control models.…”
Section: Trends In the Development Of Ai Technology Applications For mentioning
confidence: 99%
“…This increases the probability of overestimating the value of the state-action pairs (van Hasselt, 2010;van Hasselt et al, 2015). To see this more clearly, the target part of the loss in Equation 4 can be rewritten as follows:…”
Section: Double Dqn: Overcoming Overestimation and Instability Of Dqnmentioning
confidence: 99%
“…We analyse four deep RL models: Deep Q Networks (DQN) (Mnih et al, 2013), Double DQN (DDQN) (van Hasselt et al, 2015), Deep Advantage Actor-Critic (DA2C) (Sutton et al, 2000) and a version of DA2C initialized with supervised learning (TDA2C) 1 (similar idea to Silver et al (2016)). All models are trained on a restaurant-seeking domain.…”
Section: Introductionmentioning
confidence: 99%