Algorithms for Reinforcement Learning

Szepesvári, Csaba

doi:10.2200/s00268ed1v01y201005aim009

Cited by 624 publications

(421 citation statements)

References 130 publications

(128 reference statements)

Supporting

Mentioning

414

Contrasting

Unclassified

Order By: Relevance

“…Despite the valuable insights that have been generated through their design and analysis, these algorithms are of limited practical import because state spaces in most contexts of practical interest are enormous. There is a need for algorithms that generalize from past experience in order to learn how to make effective decisions in reasonable time.There has been much work on reinforcement learning algorithms that generalize (see, e.g., [5,31,32,24] and references therein). Most of these algorithms do not come with statistical or computational efficiency guarantees, though there are a few noteworthy exceptions, which we now discuss.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Wen

Roy

2017

Mathematics of OR

View full text Add to dashboard Cite

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q * lies within a known hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q * does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. We also discuss the computational complexity of OCP and present computational results involving two illustrative examples.Key words : Reinforcement Learning, Efficient Exploration, Value Function Generalization, Approximate Dynamic Programming 1. Introduction A growing body of work on efficient reinforcement learning provides algorithms with guarantees on sample and computational efficiency (see, e.g., [13,6,2,30,4,9] and references therein). This literature highlights the point that an effective exploration scheme is critical to the design of any efficient reinforcement learning algorithm. In particular, popular exploration schemes such as -greedy, Boltzmann, and knowledge gradient (see [27]) can require learning times that grow exponentially in the number of states and/or the planning horizon (see [38,29]).The aforementioned literature focusses on tabula rasa learning; that is, algorithms aim to learn with little or no prior knowledge about transition probabilities and rewards. Such algorithms require learning times that grow at least linearly with the number of states. Despite the valuable insights that have been generated through their design and analysis, these algorithms are of limited practical import because state spaces in most contexts of practical interest are enormous. There is a need for algorithms that generalize from past experience in order to learn how to make effective decisions in reasonable time.There has been much work on reinforcement learning algorithms that generalize (see, e.g., [5,31,32,24] and references therein). Most of these algorithms do not come with statistical or computational efficiency guarantees, though there are a few noteworthy exceptions, which we now discuss. A number of results treat policy-based algorithms (see [10,3] and references therein), in which the goal is to select high-performers among a pre-specified collection of policies as learning progresses. Though interesting results have been produced in this line of work, each entails quite restrictive assumptions or does not make strong guarantees. Another body of work focuses on model-based algorithms. An algorithm proposed by Kearns and Koller [12] fits a factored model to observed data and makes decisions based on the fitted model. The authors establish a sample complexity bound that is polynomial in the number of model parameters rather than the numb...

show abstract

mentioning

confidence: 99%

“…There has been much work on reinforcement learning algorithms that generalize (see, e.g., [5,31,32,24] and references therein). Most of these algorithms do not come with statistical or computational efficiency guarantees, though there are a few noteworthy exceptions, which we now discuss.…”

mentioning

confidence: 99%

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Wen

Roy

2017

Mathematics of OR

View full text Add to dashboard Cite

show abstract

“…A Reinforcement Learning (RL) agent learns its behaviour from interaction with an environment and the physical or virtual agents within it, where situations are mapped to actions by maximising a long-term reward signal [4]. An RL agent is typically characterised by: (i) a finite or infinite set of states S = {s i }; (ii) a finite or infinite set of actions A = {a j }; (iii) a state transition function T (s, a, s ) that specifies the next state s given the current state s and action a; (iv) a reward function R(s, a, s ) that specifies the reward given to the agent for choosing action a in state s and transitioning to state s ; and (v) a policy π : S → A that defines a mapping from states to actions.…”

Section: Deep Reinforcement Learning For Dialogue Controlmentioning

confidence: 99%

SimpleDS: A Simple Deep Reinforcement Learning Dialogue System

Cuayáhuitl

2016

Lecture Notes in Electrical Engineering

View full text Add to dashboard Cite

This paper presents SimpleDS, a simple and publicly available dialogue system trained with deep reinforcement learning. In contrast to previous reinforcement learning dialogue systems, this system avoids manual feature engineering by performing action selection directly from raw text of the last system and (noisy) user responses. Our initial results, in the restaurant domain, report that it is indeed possible to induce reasonable behaviours with such an approach that aims for higher levels of automation in dialogue control for intelligent interactive agents.

show abstract

“…For in-depth discussions of the use of value function approximations, we refer the reader to Chapter 6 in Bertsekas (2011a), Bertsekas (2011b), Bertsekas and Tsitsiklis (1996), Sutton and Barto (1998), Szepesvari (2010) and Chapters 8-10 of Powell (2011), and the many references cited there.…”

Section: Policies Based On Value Function Approximationsmentioning

confidence: 99%