Abstract. Motivated by the problem of estimating optimal feedback policy maps in stochastic control applications, we propose and analyze sequential design methods for ranking several response surfaces. Namely, given L ≥ 2 response surfaces over a continuous input space X , the aim is to efficiently find the index of the minimal response across the entire X . The response surfaces are not known and have to be noisily sampled one-at-a-time, requiring joint experimental design both in space and response-index dimensions.To generate sequential design heuristics we investigate Bayesian stepwise uncertainty reduction approaches, as well as sampling based on posterior classification complexity. We also make connections between our continuous-input formulation and the discrete framework of pure regret in multi-armed bandits. To model the response surfaces we utilize kriging metamodels. Several numerical examples using both synthetic data and an epidemics control problem are provided to illustrate our approach and the efficacy of respective adaptive designs.Key words. sequential design, response surface modeling, stochastic kriging, sequential uncertainty reduction, expected improvement 1. Introduction. A central step in stochastic control problems concerns estimating expected costs-to-go that are used to approximate the optimal feedback control. In simulation approaches to this question, costs-to-go are sampled by generating trajectories of the stochastic system and then regressed against current system state. The resulting Q-values are finally ranked to find the action that minimizes expected costs.When simulation is expensive, computational efficiency and experimental design become important. Sequential strategies rephrase learning the costs-to-go as another dynamic program, with actions corresponding to the sampling decisions. In this article, we explore a Bayesian formulation of this sequential design problem. The ranking objective imposes a novel loss function which mixes classification and regression criteria. Moreover, the presence of multiple stochastic samplers (one for each possible action) and a continuous input space necessitates development of targeted response surface methodologies. In particular, a major innovation is modeling in parallel the spatial correlation within each Q-value, while utilizing a multi-armed bandit perspective for picking which sampler to call next.To obtain a tractable approximation of the Q-values, we advocate the use of Gaussian process metamodels, viewing the latent response surfaces as realizations of a Gaussian random field. Consequently, the ranking criterion is formulated in terms of the posterior uncertainty about each Q-value. Thus, we connect metamodel uncertainty to the sampling decisions, akin to the discretestate frameworks of ranking-and-selection and multi-armed bandits. Our work brings forth a new link between emulation of stochastic simulators and stochastic control, offering a new class of approximate dynamic programming algorithms.