Normalizing flows are a widely used class of latent-variable generative models with a tractable likelihood. Affine-coupling models (Dinh et al., 2014(Dinh et al., , 2016 are a particularly common type of normalizing flows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of affine couplings, the special structure of the architecture makes understanding their representational power challenging. The question of universal approximation was only recently resolved by three parallel papers (Huang et al., 2020; Zhang et al., 2020;Koehler et al., 2020) who showed reasonably regular distributions can be approximated arbitrarily well using affine couplings-albeit with networks with a nearly-singular Jacobian. As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned affine coupling flows?In this paper, we show that any log-concave distribution can be approximated using wellconditioned affine-coupling flows. In terms of proof techniques, we uncover and leverage deep connections between affine coupling architectures, underdamped Langevin dynamics (a stochastic differential equation often used to sample from Gibbs measures) and Hénon maps (a structured dynamical system that appears in the study of symplectic diffeomorphisms). Our results also inform the practice of training affine couplings: we approximate a padded version of the input distribution with iid Gaussians-a strategy which Koehler et al. ( 2020) empirically observed to result in better-conditioned flows, but had hitherto no theoretical grounding. Our proof can thus be seen as providing theoretical evidence for the benefits of Gaussian padding when training normalizing flows. Definition 3. We say that a symmetric matrix is positive semidefinite (PSD) if all of its eigenvalues are non-negative. For symmetric matrices A, B, we write A B if and only if A − B is PSD.Definition 4. Given two probability measures µ, ν over a metric space (M, d), the Wasserstein-1 distance between them, denoted W 1 (µ, ν), is defined aswhere Γ(µ, ν) is the set of couplings, i.e. measures on M × M with marginals µ, ν respectively. For two probability distributions p, q, we denote by W 1 (p, q) the Wasserstein-1 distance between their associated measures. In this paper, we set M = R d and d(x, y) = x − y 2 . Definition 5. Given a distribution q and a compact set C, we denote by q| C the distribution q truncated to the set C. The truncated measure is defined as q| C (A) = 1 q(C) q(A ∩ C).
Main resultOur main result states that we can approximate any log-concave distribution in Wasserstein-1 distance by a well-conditioned affine-coupling flow network. Precisely, we show: