Most policy search algorithms require thousands of training episodes to find an effective policy, which is often infeasible with a physical robot. This survey article focuses on the extreme other end of the spectrum: how can a robot adapt with only a handful of trials (a dozen) and a few minutes? By analogy with the word "big-data", we refer to this challenge as "micro-data reinforcement learning". We show that a first strategy is to leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators). A second strategy is to create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based policy search), so that the policy optimizer queries the model instead of the real system. Overall, all successful micro-data algorithms combine these two strategies by varying the kind of model and prior knowledge. The current scientific challenges essentially revolve around scaling up to complex robots, designing generic priors, and optimizing the computing time.1 In some rare cases, a process can be "optimally efficient". 2 It is challenging to put a precise limit for "micro-data learning" as each domain has different experimental constraints, this is why we will refer in this article to "a few minutes" or a "a few trials". The commonly used word "big-data" has a similar "fuzzy" limit that depends on the exact domain. 3 Planning-based and model-predictive control [59] methods do not search for policy parameters, this is why they do not fit into the scope of this paper. Although trajectory-based policies and planning-based methods share the same goal, they usually search in a different space: planning algorithms search in the state-action space (e.g., joint positions/velocities), whereas policy methods will search for the optimal parameters of the policy, which can encode a Chatzilygeroudis, Vassiliades, Stulp, Calinon and Mouret arXiv | 1 arXiv:1807.02303v4 [cs.RO] 31 May 2019 Chatzilygeroudis, Vassiliades, Stulp, Calinon and Mouret arXiv | 2This is basically sampling the distribution over trajectories, P (τ |θ), which is feasible since the sampling is performed with the models. When applying the same policy (i.e., a policy with the same parameters θ), the trajectories τ (and consequently r)Chatzilygeroudis, Vassiliades, Stulp, Calinon and Mouret arXiv | 9