Convergence of a Q-learning Variant for Continuous States and Actions

Carden, Stephen W.

doi:10.1613/jair.4271

Cited by 9 publications

(5 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Under standard assumptions, CAQL with dynamic tolerance {τ t } converges a.s. to a stationary point (Thm. 1, (Carden, 2014)).…”

Section: Accelerating Max-q Computationmentioning

confidence: 99%

CAQL: Continuous Action Q-Learning

Ryu,

Chow,

Anderson

et al. 2019

Preprint

View full text Add to dashboard Cite

Value-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization (max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plugand-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP). When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically.

show abstract

“…Under standard assumptions, CAQL with dynamic tolerance {τ t } converges a.s. to a stationary point (Thm. 1, (Carden, 2014)).…”

Section: Accelerating Max-q Computationmentioning

confidence: 99%

CAQL: Continuous Action Q-Learning

Ryu,

Chow,

Anderson

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In other words, the controller structure should induce similar decisions from similar observation chains. This typical assumption is also made by the continuous state-action MDP and POMDP literature [7], [8], [19].…”

Section: A Stochastic Kernel-based Finite State Automatamentioning

confidence: 99%

Scalable accelerated decentralized multi-robot policy search in continuous observation spaces

Omidshafiei

Amato

Liu

et al. 2017

2017 IEEE International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

This paper presents the first ever approach for solving continuous-observation Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) and their semi-Markovian counterparts, Dec-POSMDPs. This contribution is especially important in robotics, where a vast number of sensors provide continuous observation data. A continuous-observation policy representation is introduced using Stochastic Kernelbased Finite State Automata (SK-FSAs). An SK-FSA search algorithm titled Entropy-based Policy Search using Continuous Kernel Observations (EPSCKO) is introduced and applied to the first ever continuous-observation Dec-POMDP/Dec-POSMDP domain, where it significantly outperforms stateof-the-art discrete approaches. This methodology is equally applicable to Dec-POMDPs and Dec-POSMDPs, though the empirical analysis presented focuses on Dec-POSMDPs due to their higher scalability. To improve convergence, an entropy injection policy search acceleration approach for both continuous and discrete observation cases is also developed and shown to improve convergence rates without degrading policy quality.

show abstract

“…Traditionally, value function V 0 ( v, τ ) is initialized as 0, which may slow down the convergence speed (Carden 2014). Therefore, we propose a warm start strategy 3 that approximates the probability of arriving on time for vehicles at intersection v with time-to-deadline τ as follows: V 0 ( v, τ ) = 1/(1 + e −ζ(τ −Te) ), where ζ is the coefficient.…”

Section: Other Practical Considerationsmentioning

confidence: 99%

Maximizing the Probability of Arriving on Time: A Practical Q-Learning Method

Cao

Guo

Zhang

et al. 2017

AAAI

View full text Add to dashboard Cite

The stochastic shortest path problem is of crucial importance for the development of sustainable transportation systems. Existing methods based on the probability tail model seek for the path that maximizes the probability of arriving at the destination before a deadline. However, they suffer from low accuracy and/or high computational cost. We design a novel Q-learning method where the converged Q-values have the practical meaning as the actual probabilities of arriving on time so as to improve accuracy. By further adopting dynamic neural networks to learn the value function, our method can scale well to large road networks with arbitrary deadlines. Experimental results on real road networks demonstrate the significant advantages of our method over other counterparts.

show abstract

Convergence of a Q-learning Variant for Continuous States and Actions

Cited by 9 publications

References 28 publications

CAQL: Continuous Action Q-Learning

CAQL: Continuous Action Q-Learning

Scalable accelerated decentralized multi-robot policy search in continuous observation spaces

Maximizing the Probability of Arriving on Time: A Practical Q-Learning Method

Contact Info

Product

Resources

About