Optimal transport (OT) theory can be informally described using the words of the French mathematician Gaspard Monge (1746-1818): A worker with a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to erect with all that sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified for instance as the total distance or time spent carrying shovelfuls of sand. Mathematicians interested in OT cast that problem as that of comparing two probability distributions-two different piles of sand of the same volume. They consider all of the many possible ways to morph, transport or reshape the first pile into the second, and associate a "global" cost to every such transport, using the "local" consideration of how much it costs to move a grain of sand from one place to another. Mathematicians are interested in the properties of that least costly transport, as well as in its efficient computation. That smallest cost not only defines a distance between distributions, but it also entails a rich geometric structure on the space of probability distributions. That structure is canonical in the sense that it borrows key geometric properties of the underlying "ground" space on which these distributions are defined. For instance, when the underlying space is Euclidean, key concepts such as interpolation, barycenters, convexity or gradients of functions extend naturally to the space of distributions endowed with an OT geometry.OT has been (re)discovered in many settings and under different forms, giving it a rich history. While Monge's seminal work was motivated by an engineering problem, Tolstoi in the 1920s and Hitchcock, Kantorovich and Koopmans in the 1940s established its significance to logistics and economics. Dantzig solved it numerically in 1949 within the framework of linear programming, giving OT a firm footing in optimization. OT was later revisited by analysts in the 1990s, notably Brenier, while also gaining fame in computer vision under the name of earth mover's distances. Recent years have witnessed yet another revolution in the spread of OT, thanks to the emergence of approximate solvers that can scale to large problem dimensions. As a consequence, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), graphics (for shape manipulation) or machine learning (for regression, classification and generative modeling).This paper reviews OT with a bias toward numerical methods, and covers the theoretical properties of OT that can guide the design of new algorithms. We focus in particular on the recent wave of efficient algorithms that have helped OT find relevance in data sciences. We give a prominent place to the many generalizations of OT that have been proposed in but a few years, and connect them with related approaches originating from statistical inference, kernel methods and information theory. All of the figures can...