We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q * lies within a known hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q * does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. We also discuss the computational complexity of OCP and present computational results involving two illustrative examples.Key words : Reinforcement Learning, Efficient Exploration, Value Function Generalization, Approximate Dynamic Programming 1. Introduction A growing body of work on efficient reinforcement learning provides algorithms with guarantees on sample and computational efficiency (see, e.g., [13,6,2,30,4,9] and references therein). This literature highlights the point that an effective exploration scheme is critical to the design of any efficient reinforcement learning algorithm. In particular, popular exploration schemes such as -greedy, Boltzmann, and knowledge gradient (see [27]) can require learning times that grow exponentially in the number of states and/or the planning horizon (see [38,29]).The aforementioned literature focusses on tabula rasa learning; that is, algorithms aim to learn with little or no prior knowledge about transition probabilities and rewards. Such algorithms require learning times that grow at least linearly with the number of states. Despite the valuable insights that have been generated through their design and analysis, these algorithms are of limited practical import because state spaces in most contexts of practical interest are enormous. There is a need for algorithms that generalize from past experience in order to learn how to make effective decisions in reasonable time.There has been much work on reinforcement learning algorithms that generalize (see, e.g., [5,31,32,24] and references therein). Most of these algorithms do not come with statistical or computational efficiency guarantees, though there are a few noteworthy exceptions, which we now discuss. A number of results treat policy-based algorithms (see [10,3] and references therein), in which the goal is to select high-performers among a pre-specified collection of policies as learning progresses. Though interesting results have been produced in this line of work, each entails quite restrictive assumptions or does not make strong guarantees. Another body of work focuses on model-based algorithms. An algorithm proposed by Kearns and Koller [12] fits a factored model to observed data and makes decisions based on the fitted model. The authors establish a sample complexity bound that is polynomial in the number of model parameters rather than the numb...